The idea here is to allow "in frame" adjustment of the final Q
value used to encode each SB64, using segmentation.
There is also adjustment of the rd mult in regions of overspend.
Activated using aq_mode=2
Change-Id: I2f140cd898c9f877c32cd6d2e667f5e11ada4b1c
The decoder will construct inter predictor using lazy border extension,
while the encoder, going with multiple runs of motion search in the rate-
distortion optimization loop for each block, does border extension at
frame level. This commit makes separate the inter predictors for encoder
and decoder, respectively.
Change-Id: Ieca2fecba3a7201a6d64ef9f219e5d91e50559c3
When calling check_initial_width through vp9_set_size_literal
the function was defaulting to using non-subsampled chroma.
This patch changes the default to assume sampled chroma as
an interim solution until complete support for other
color formats is added.
Change-Id: Id8e7e919b350e3473dfdf7551af6fd0716478b04
This commit takes out vp9_extend_frame_borders from
vp9_setup_scale_factors.
The refactoring is for the preparation of the use of lazy border
extension at decoder. This makes it necessary to handle border
extension separately at encoder/decoder. The use of
vp9_extend_frame_borders will be removed, when lazy border extension
is ready.
Change-Id: Ia3baba3d179d5f11eee1634f19b3b319d2a59186
The decoder ignored the display width & height
specified in the frame header.
This patch adds a control, VP9D_GET_DISPLAY_SIZE, to
allow the application to obtain the display width and
height from the frame header.
vpxdec has been modified to scale the output frame to
this size.
Should the request for the display size fail vpxdec will
use the native width and height of the raw decoded
frame instead.
Change-Id: I25db04407426dac730263720c75a7dd6400af68a
This patch followed "Add filter_selectively_vert_row2 to enable
parallel loopfiltering" commit, and added x86 SSE2 optimization
to do 16-pixel filtering in parallel. For other optimizations
(neon and dspr2), current 16-pixel functions were done by calling
8-pixel functions twice, and real 16-pixel functions could be added
later.
Decoder speedup:
tulip clip: 2% speed gain;
old_town_cross: 1.2% speed gain;
bus: 2% speed gain.
Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
Moves all rate control variables to a separate structure,
removes some currently unused variables,
moves some rate control functions to vp9_ratectrl.c,
and splits the encode_frame_to_data_rate function.
Change-Id: I4ed54c24764b3b6de2dd676484f01473724ab52b
Using for loop based on max_tx_size instead of separate checks. Combining
build_coeff_contexts() with update_coef_probs().
Change-Id: Ie335a7db29830677fbc14478a9c190d3c1068665
Modifications are done to reduce the total clock cycle.
Speedup: 1.2
Tested with: park_joy_420_720p50.y4m
Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f
Both functions have no relation to motion vectors, so moving them from
vp9_findnearmv.h to vp9_blockd.h.
Change-Id: I74f524267886ab0fff4a2da793a10c906ed0f43a
Added filter_selectively_vert_row2 to be ready for parallel
loopfiltering in vertical direction. This change did 2-row
filtering at a time. If 2 vertically adjacent 8x8 blocks do same
type of filtering, we can do 16-pixel filtering in parallel.
Next, we need to provide 16-pixel loopfiltering functions in c
and optimized versions for codec speedup.
Change-Id: Idf97bbdd70566e55bd30e1fd25cb8544e33291be
Add support to do 16 pixel horizontal filtering in Neon.
Nexus devices saw about 0.5% decode speed increase.
Change-Id: I2993f6c2d49f31fa74976879eeaa289fd3f4e15d
This function is called from vp9_setup_past_independence() which is called
before the modified piece of code. Moving reset of inter_mode_probs into
vp9_init_mbmode_probs() for consistency.
Change-Id: Ib188e8798e1fbe15407fd501406761b746fdda95
Although no mismatch was indicated for 8/16 wide sub-pixel filters
in issue 661, they had similar problems that could cause mismatch
potentially. This patch fixed calculations in HORIZx8/16
and VERTx8/16.
Change-Id: I169961c9d40a20340995b7d22aafc89ccf30bfca
In commit "3d50da5397d20abc932d81453b26cde758293a40", the stack
pointer was modified while aligning the stack, and it needed to
be pop out at the end.
Change-Id: I062971e195f1f2ab9d0ab5fb84dcf215a0fcaa67
There are many places in handle_inter_mode that need to restore the
dst buffer pointers, due to buffer pointer swap and early rd search
breakout. This commit wraps these operations into an inline function
for clean-up.
Change-Id: I0462e8c41c8bc3cd8db07395489cac03d8e5be54
This patch fixed issue 661: "Decoder produces mismatched outputs
with ssse3 enabled and disabled." In sub-pixel filters, a pixel
value was multiplied by a filter coefficient, and the results
were added up. The order of adding up these multiplications had to
be arranged carefully to prevent incorrect overflowing.
Change-Id: Id08af4200fea9e1b896fc40157b8651c2c7e80f2
Reversing bit order of partition_context_lookup, and modifying accordingly
update_partition_context() and partition_plane_context().
Change-Id: I64a11f1a94962a3bf217de2f50698cb781db71a5
This rebase is a better implementation of the previous ones.
Modifications are done to reduce the total clock cycle.
Speedup: 1.341
Compiled with -O3
Tested with: park_joy_420_720p50.y4m
Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d
Commit a4a5a210 enabled lossless coding, but the commit incorrectly
disabled the usage of skip in encoder even when skip should be used.
This commit make sure that skip is enabled even in lossless mode.
Change-Id: I276954f952c6ac68f17a316ebc72f09001228a08
Since they used in encoder only. This commit also re-order includes
for the files that include vp9_extend.h
Change-Id: I929fc113f2135d3198cd1fc6a17434e5a2f8a459
Explicitly constrain the upper limit of motion search range (in the
unit of full pixel) to be [-1023, +1023]. It is intended to control
the effective motion search range for 4K sequences.
Change-Id: I645539c70885eec0f155781f439d97d333336e88
This patch followed "Rewrite filter_selectively_horiz for parallel
loopfiltering" commit, and added x86 SSE2 optimization to do
16-pixel filtering in parallel. Also, corrected the declaration
of aligned arrays. For 8-pixel-in-parallel case, improved the
calculation of the masks and filters. Updated the threshold loading
since the thresholds were already duplicated. Updated neon C functions
to call neon loopfilters twice.
Using tulip clip, tests showed it gave a ~1.5% decoder speed gain.
Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
Separate the rounding and right shift operations of forward transform
from those of inverse transform. Take out the assertion check from
inverse transforms. If the transform coefficients were constructed to
cause intermediate steps of inverse transform overflow, the codec will
just let it overflow without breaking the decoding flow.
Change-Id: I73cfc3706c4e840fc543a77cbc4cdb0b05d07730
on arm until we implenment real vp9_idct32x32_34_add_neon.
This issue is due to commit 47665452f0
Merge "Add 32x32 idct function for eob<=34 case".
Change-Id: I56b5f0abc20e7dd1bba521f78a995e85d65ea296
Simplifies the code by implementing band mapping with static arrays.
A lot of the code complexity introduced in a previous patch
disappears.
Change-Id: Ia3fac36e594fb5ad2d55ae141c58bba4c55c2d28
The switch to the rate-correction damping factor
in https://gerrit.chromium.org/gerrit/#/c/67536/ was not conditioned on CBR mode.
Change-Id: I2326704e8ac030a4f7b592dd3fedb94c7dd0644d
The step that sums three input samples could potentially cause the
intermediate result go beyond 16 bit limit, when operating as the
second 1-D transform. This commit fixes the issue.
Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516
Overall change (using dual buffer scheme for superblocks of both inter
and intra modes) reduces speed 2 runtime:
bluesky_1080p at 6000kbps: 263553ms -> 257441ms
riverbed_1080p at 8000kbps: 233230ms -> 225308ms.
Change-Id: Idf8d70f768a4b0d97b2a8506372c57b7b4022119
As Jim suggested, 1D array was used to store filter levels instead
of 2D array. This used shift_y in setup_mask directly, and saved
few cycles.
Change-Id: If61ab298784861f1806b1cd396d4e4e2e0f097b9
Implements scan order to band map with arrays in both the encoder
and decoder to remove conditional statements.
Encoding seems to be about 1% faster at speed 0, tested on football.
Decoding seems to be about 0.5-1% faster on a set of 25 videos.
Change-Id: Idb233ca0b9e0efd790e30880642e8717e1c5c8dd
This commit enables the dual buffer rate-distortion optimization
and encoding scheme. It stacks the original transform coefficients,
quantized levels, and reconstructed coefficients, in the rate-
distortion optimization search process, hence eliminates the need
to re-run residual generation, forward transform, and quantization
in the encoding stage.
Change-Id: I011bfad3a59a380a869ee552e91dae0394ec492e
Added loop filter mask checking, and made the caller function
ready for implementation of parallel loopfiltering in horizontal
direction.
Next, we need to go through the loopfilter functions (both c and
optimized versions), and provide 16-byte wide loopfiltering for
each filter type.
Change-Id: Ifef47e7ef9086ebc2fd6ca7ede8f27c9bbf79e66
Allocate memory space of dual buffer sets that store the coeff, qcoeff,
dqcoeff, and eobs. Connect the pointers of macroblock_plane and
macroblockd_plane to the actual buffer in use accordingly.
Change-Id: I2f0b5f482ca879fae39095013eaf8901db20a5a4
Make the macroblockd_plane contain dynamic buffer pointers instead
static pointers to the memory space allocated therein. The decoder
uses the buffer allocated in pbi, while encoder will use a dual
buffer approach for rate-distortion optimization search.
Change-Id: Ie6f24be2dcda35df7c15b4014e5ccf236fb3f76c
We only used "ib" to call get_scan() function, which in turn calls
get_tx_type_4x4() function. The latter one only needs block index if
bsize < BLOCK_8X8 -- under that condition raster_block == block.
Change-Id: I697306a0c3cf937acdd4f5e623d4367c5acc0b2f
This commit fixes the assignment of mode_info pointer per tile. It
makes recognition of tiles in both row and column formats and properly
arrange the use of mode_info.
The bug was first introduced in
I6226456dd11f275fa991e4a7a930549da6675915
https://gerrit.chromium.org/gerrit/#/c/67492/
Change-Id: Ie12cd209f53241513728c461ee3d7b9599ddb860
Inlining set_contexts_on_border() into set_contexts(). The only difference
is the additional check that "has_eob != 0" in addition to
"xd->mb_to_right_edge < 0" and "xd->mb_to_right_edge < 0". If has_eob == 0
then memset does the right thing and works faster.
Change-Id: I5206f767d729f758b14c667592b7034df4837d0e
This patch continued the work done in "Rewrite loop_filter_info_n
struct"(commit:00dbd369c70270428d56da6d15ea5486fc821c52) to further
improve loopfilter function.
1. Instead of storing pointers to thresholds, store loopfilter
levels within 64x64 SB;
2. Since loopfilter levels are already calculated in setup_mask,
we don't need call build_lfi to look up them again. Just save
loopfilter levels in setup_mask.
3. Reorganized and simplified filter_block_plane().
Tests showed a ~0.8% decoder speedup.
Change-Id: I723c7779738bbc2afcb9afa2c6f78580ee6c3af7
This to make sure that prediction residue always get coded in lossless
mode.
This commit also fixed lossless unit test
Change-Id: I537726ee55328d4e4cf0a0196393a67e12bfcde1
The new expression is much more logical than previous one. Surprisingly
both expressions give exactly the same set of dependent values
-- have_top, have_left, have_right -- in vp9_predict_intra_block.
Change-Id: I63eb1b592b8c37883b3a0dbb1f3daa271e446109
This patch fixed the issue reported in "Issue 655: remove textrel's
from 32-bit vp9 encoder". The set of vp9_subpel_variance functions
that used x86inc.asm ABI didn't build correctly for 32bit PIC. The
fix was carefully done under the situation that there was not
enough registers.
After the change, we got
$ eu-findtextrel libvpx.so
eu-findtextrel: no text relocations reported in 'libvpx.so'
Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82
The term x represents macroblock pointer across encode_block. Change
the two local variable names to avoid confusion.
Change-Id: Ic732e73023525d673c0a678ed2708ac1edf5a3f9
Now tile decoding consists of two stages:
1. Find tile buffer start and its size, put this info into tile_buffers.
2. Decode each tile based on information from tile_buffers.
It seems that stage 1 can also be reused by multithreaded tile decoder.
Change-Id: If0cdaefdd6d10bb41c63561346c9ae4cfac081dd
It is more logical to use dqcoeff buffer to put there *dequantized*
transform coefficients (inside inverse_transform_block and
decode_coefs functions). Dequantization happens inside WRITE_COEF_CONTINUE
macro.
qcoeff buffer should be only used in the encoder for *quantized*
transform coefficients.
Change-Id: Ifd54bef272bbf5311ced6669c4f1079f998af5d7
SVC multiple layer per frame encoding is invoked with vpx_svc_init and
vpx_svc_encode. These interfaces are designed to be invoked from ffmpeg.
Additional improvements:
- make dummy frame handling a bit more explicit
- fixed bug with single layer encodes
- track individual frame sizes and psnrs instead of averages
- parameterized quantizer, 16th scalefactors, more logging,
- enabled single layer encodes to generate baseline
- include new mode for 3 layer I frame with 5 total layers
Change-Id: I46cfa600d102e208c6af8acd6132e0cc25cda8d4