This commit adds an encoder workaround to support better
compatibility with a non-compliant hardware vp9 profile 2 decoder.
The known issue with this decoder is:
The decoder assumes a wrong value, 127 instead of the correct
value of 511 and 2047, for any assumed top-left corner pixel in
UV planes for 10 and 12 bit, respectively. Such assumed
top-left corner pixel is used for INTRA prediction when a real
decoded/reconstructed pixel is not avalable, e.g. when it is
located inside the row above the top row or inside the column
left to the leftest column of a video image.
Change-Id: Ic15a938a3107e1b85e96cb7903a5c4220986b99d
This commit fixes issue 1141. The issue was triggered in multi-tile
encoding. The change properly saves and restores the block context
information in the real-time mode selection process. It removes
several redundant memcpy operations in sub8x8 intra block mode search.
Change-Id: I35c9ad197f4bd500ec39b5fc833f052f19eee010
Make this consistent with regular block size rate-distortion
optimization. It improves the compression performance:
derf 0.055%
hevcmr 0.129%
Change-Id: I112fe734f592c21bc7aa6efb7e3f269c4214ee7b
It improves the compression performance of VP9 by 0.1% across all
test sets. No speed change is observed.
Change-Id: I59338c5c9e67bae22188f35fc3afbfe2a6bba6b0
This is a pure-refactor in preparation to potentially raise the bit-cost
resolution.
Verified at good speed 0 and rt speed -6.
Change-Id: I5347e6e8c28a9ad9dd0aae1d76a3d0f3c2335bb9
This commit enables encoder to avoid 8x4 and 4x8 partitions for
scaled reference frames when libvpx is configured and built with
--enable-better-hw-compatibility
Change-Id: I02ad65c386f5855f4325d72570c49164ed52f413
This commit makes the sub8x8 block rate-distortion optimization
scheme use precise motion compensated prediction to compute the rd
cost. It fixes a potential buffer overflow issue related to sub8x8
motion search on scaled reference frame.
Change-Id: I4274992ef4f54eaacfde60db045e269c13aaa2de
This change alters the nature and use of exhaustive motion search.
Firstly any exhaustive search is preceded by a normal step search.
The exhaustive search is only carried out if the distortion resulting
from the step search is above a threshold value.
Secondly the simple +/- 64 exhaustive search is replaced by a
multi stage mesh based search where each stage has a range
and step/interval size. Subsequent stages use the best position from
the previous stage as the center of the search but use a reduced range
and interval size.
For example:
stage 1: Range +/- 64 interval 4
stage 2: Range +/- 32 interval 2
stage 3: Range +/- 15 interval 1
This process, especially when it follows on from a normal step
search, has shown itself to be almost as effective as a full range
exhaustive search with step 1 but greatly lowers the computational
complexity such that it can be used in some cases for speeds 0-2.
This patch also removes a double exhaustive search for sub 8x8 blocks
which also contained a bug (the two searches used different distortion
metrics).
For best quality in my test animation sequence this patch has almost
no impact on quality but improves encode speed by more than 5X.
Restricted use in good quality speeds 0-2 yields significant quality gains
on the animation test of 0.2 - 0.5 db with only a small impact on encode
speed. On most clips though the quality gain and speed impact are small.
Change-Id: Id22967a840e996e1db273f6ac4ff03f4f52d49aa
A new version of vp9_highbd_error_8bit is now available which is
optimized with AVX assembly. AVX itself does not buy us too much, but
the non-destructive 3 operand format encoding of the 128bit SSEn integer
instructions helps to eliminate move instructions. The Sandy Bridge
micro-architecture cannot eliminate move instructions in the processor
front end, so AVX will help on these machines.
Further 2 optimizations are applied:
1. The common case of computing block error on 4x4 blocks is optimized
as a special case.
2. All arithmetic is speculatively done on 32 bits only. At the end of
the loop, the code detects if overflow might have happened and if so,
the whole computation is re-executed using higher precision arithmetic.
This case however is extremely rare in real use, so we can achieve a
large net gain here.
The optimizations rely on the fact that the coefficients are in the
range [-(2^15-1), 2^15-1], and that the quantized coefficients always
have the same sign as the input coefficients (in the worst case they are
0). These are the same assumptions that the old SSE2 assembly code for
the non high bitdepth configuration relied on. The unit tests have been
updated to take this constraint into consideration when generating test
input data.
Change-Id: I57d9888a74715e7145a5d9987d67891ef68f39b7
If high bit depth configuration is enabled, but encoding in profile 0,
the code now falls back on optimized SSE2 assembler to compute the
block errors, similar to when high bit depth is not enabled.
Change-Id: I471d1494e541de61a4008f852dbc0d548856484f
Access scaled reference frame in the sub8x8 rate-distortion
optimization loop only when the current test mode is an inter mode.
This prevents an ioc warning triggered by sending intra_frame index
to fetch scaled reference frame.
Change-Id: I6177ecc946651dd86c7ce362e3f65c4074444604
This commit allows the encoder to include sub8x8 inter mode with
scaled reference frame in the rate-distortion optimization scheme.
Change-Id: Ibbe9678801592826ef22566566dcdeeb008350d5
Don't run rate_block (cost_coeffs) if distortion alone is enough to
surpass best_rd.
This decreases 2nd pass runtime on HD at speed 2 by about 2%. There is
zero effect on output if tx_cache is removed.
Change-Id: Ia3b1cc77bfbe6ee988c395fde06c0eb92940b784
1. The RD scores obtained during the tx size selection were stored in the
tx cache, and used to help make the tx decision for the following frames.
This wasn't used anymore in VP9 encoder. Recovered the related decision
making code from 1.5+ years ago, and borg tests didn't show any quality
gain. This patch removed it to lower the complexity.
2. An optimization was done after the above refactoring. If the tx_mode
is not TX_MODE_SELECT, we only need to test the chosen tx size instead
of all posible tx sizes. This gave a 1.5% average speed gain at speed 2,
and a 1% average speed gain at speed 3.
Change-Id: Id8cd650e066a8cef33829d8c15388a8138adc78c
Separate the hybrid transform case from 2D-DCT case. This will
allow us to clear up cross dependency between c and SIMD
implementations later.
Change-Id: Iaa499e8b096850a1c5a0c50a3b6e63e15d0184bf
This commit simplifies the intra block boundary condition logic.
It removes the block index from the argument set.
Change-Id: If00142512eb88992613d6609356dfd73ba390138
Changes to allow more use of rectangular partitions at
speeds 1 and 2 for content classed by the first pass as
animation and for blocks near the active image edge.
This has quite a big impact in quality for the animated
test sequence but also hurts encode speed for speed 2.
For other content types the impact on both speed and
quality is small.
Added some plumbing for detection of internal vertical
image edges.
Change-Id: I3fc48de2349f8cb87946caaf0b06dbb0ea261a9a
Change speed features / behavior for split mode when there
is an internal active edge (e.g. formatting bars).
Remove some threshold constraints in rd code near the active
edge of the image.
Add some plumbing for left and right active edge detection.
Patch set 5. Limit rd pass through for sub 8x8 to internal active edges.
This takes away any speed penalty for most clips but keeps the enhanced
edge coding for the more critical case of internal image edges
Change-Id: If644e4762874de4fe9cbb0a66211953fa74c13a5
to MB_MODE_INFO_EXT. This saves 36 bytes per 8x8 area for
both the decoder and encoder. (encoder has two MODE_INFO
buffers)
Change-Id: If006abb2224acaf326df3c2be09e77e967662107
Various header/test files had to be re-worked in order to
build "Remove cm parameter from vp9_decode_block_tokens()".
This patch reverts the "Remove cm" part and only contains
the re-worked header files.
Change-Id: I520958a88d1991fee988a3c784d0eac40e117a32
With the sad functions, and hopefully the variance functions soon,
moving to the vpx_dsp location, place the defines used in the
reference C code in a common location.
Change-Id: I4c8ce7778eb38a0a3ee674d2f1c488eda01cfeca
this macro was used inconsistently and only differs in behavior from
DECLARE_ALIGNED when an alignment attribute is unavailable. this macro
is used with calls to assembly, while generic c-code doesn't rely on it,
so in a c-only build without an alignment attribute the code will
function as expected.
Change-Id: Ie9d06d4028c0de17c63b3a27e6c1b0491cc4ea79
(see I3a05cf1610679fed26e0b2eadd315a9ae91afdd6)
For the test clip used, the decoder performance improved by ~2%.
This is also an intermediate step towards adding back the
mode_info streams.
Change-Id: Idddc4a3f46e4180fbebddc156c4bbf177d5c2e0d
This commit separates Hadamard transform/quantization operations
from rate and distortion computation in block_yrd. This allows one
to skip SATD computation when all transform blocks are quantized
to zero. It also uses a new block error function that skips
repeated computation of sum of squared residuals. It reduces the
CPU cycles spent on block error calculation in block_yrd by 40%.
Change-Id: I726acb2454b44af1c3bd95385abecac209959b10
To enable us to the scale-invariant motion estimation
code during mode selection, each of the reference
buffers is scaled to match the size of the frame
being encoded.
This fix ensures that a unit scaling factor is used in
this case rather than the one calculated assuming that
the reference frame is not scaled.
Change-Id: Id9a5c85dad402f3a7cc7ea9f30f204edad080ebf
Revised adjustment for rd based on source complexity.
Two cases:
1) Bias against low variance intra predictors
when the actual source variance is higher.
2) When the source variance is very low to give a slight
bias against predictors that might introduce false texture
or features.
The impact on metrics of this change across the test sets is
small and mixed.
derf -0.073%, -0.049%, -0.291%
std hd -0.093%, -0.1%, -0.557%
yt +0.186%, +0.04%, - 0.074%
ythd +0.625%, + 0.563%, +0.584%
Medium to strong psycho-visual improvements in some
problem clips.
This feature and intra weight on GF group length now
turned on by default.
Change-Id: Idefc8b633a7b7bc56c42dbe19f6b2f872d73851e
This experiment biases the rd decision based on the impact
a mode decision has on the relative spatial complexity of the
reconstruction vs the source.
The aim is to better retain a semblance of texture even if it
is slightly misaligned / wrong, rather than use a simple rd
measure that tends to favor use of a flat predictor if a perfect
match can't be found.
This improves the appearance of texture and visual quality
on specific test clips but is hidden under a flag and currently
off by default pending visual quality testing on a wider Yt set.
Change-Id: Idf6e754a8949bf39ed9d314c6f2daaa20c888aad
The joint_motion_search function alternates prediction
between two reference frames. In order to reuse existing
code, a pointer to the appropriate reference frame is
written into xd->plane[0].pre[0], that the motion
estimation code assumes points to the reference frame.
If this first reference frame was scaled then the
pointer was incorrectly being reset to point to the
unscaled reference frame rather than the scaled
version.
Change-Id: I76f73a8d8f4f15c1f3a5e7e08a35140cdb7886ab
It was tiny when it was orginally marked INLINE. Forcing this function
to be inlined prevents the compiler from inlining its much smaller
callers.
No measurable speed impact, 28320 byte smaller libvpx.a
Change-Id: I6bf4c917157d15cbadb3cd3e20a9e82d35dc7d6f
Frame buffers are now allocated dynamically on-demand.
Entries in the reference frame map, cm->ref_frame_map,
may now be set to -1 (INVALID_IDX) to indicate that
there is not a valid reference buffer in that "slot".
All slots in the reference frame map are now initialized
to the empty state (-1) and each buffer is initialized
to have a reference count of 0.
Change-Id: Id1afe98de98db4ae8b2dfefed7889c3b28c68582
Note: This feature is still in development.
Add an option for the encoder to decide the resolution
at which to encode each frame.
Each KF/GF/ARF goup is tested to see if it would be
better encoded at a lower resolution. At present, each
KF/GF/ARF is coded first at full-size and if the coded
size exceeds a threshold (twice target data rate) at
the maximum active Q then the entire group is encoded
at lower resolution.
This feature is enabled in vpxenc by setting:
--resize-allowed=1
In addition, if the vpxenc command line also specifies
valid frame dimensions using:
--resize-width=XXXX & --resize_height=YYYY
then *all* frames will be encoded at this resolution.
Change-Id: I13f341e0a82512f9e84e144e0f3b5aed8a65402b
In frame parallel decode, libvpx decoder decodes several frames on all
cpus in parallel fashion. If not being flushed, it will only return frame
when all the cpus are busy. If getting flushed, it will return all the
frames in the decoder. Compare with current serial decode mode in which
libvpx decoder is idle between decode calls, libvpx decoder is busy
between decode calls.
Current frame parallel decode will only speed up the decoding for frame
parallel encoded videos. For non frame parallel encoded videos, frame
parallel decode is slower than serial decode due to lack of loopfilter
worker thread.
There are still some known issues that need to be addressed. For example:
decode frame parallel videos with segmentation enabled is not right sometimes.
* frame-parallel:
Add error handling for frame parallel decode and unit test for that.
Fix a bug in frame parallel decode and add a unit test for that.
Add two test vectors to test frame parallel decode.
Add key frame seeking to webmdec and webm_video_source.
Implement frame parallel decode for VP9.
Increase the thread test range to cover 5, 6, 7, 8 threads.
Fix a bug in adding frame parallel unit test.
Add VP9 frame-parallel unit test.
Manually pick "Make the api behavior conform to api spec." from master branch.
Move vp9_dec_build_inter_predictors_* to decoder folder.
Add segmentation map array for current and last frame segmentation.
Include the right header for VP9 worker thread.
Move vp9_thread.* to common.
ctrl_get_reference does not need user_priv.
Seperate the frame buffers from VP9 encoder/decoder structure.
Revert "Revert "Revert "Revert 3 patches from Hangyu to get Chrome to build:"""
Conflicts:
test/codec_factory.h
test/decode_test_driver.cc
test/decode_test_driver.h
test/invalid_file_test.cc
test/test-data.sha1
test/test.mk
test/test_vectors.cc
vp8/vp8_dx_iface.c
vp9/common/vp9_alloccommon.c
vp9/common/vp9_entropymode.c
vp9/common/vp9_loopfilter_thread.c
vp9/common/vp9_loopfilter_thread.h
vp9/common/vp9_mvref_common.c
vp9/common/vp9_onyxc_int.h
vp9/common/vp9_reconinter.c
vp9/decoder/vp9_decodeframe.c
vp9/decoder/vp9_decodeframe.h
vp9/decoder/vp9_decodemv.c
vp9/decoder/vp9_decoder.c
vp9/decoder/vp9_decoder.h
vp9/encoder/vp9_encoder.c
vp9/encoder/vp9_pickmode.c
vp9/encoder/vp9_rdopt.c
vp9/vp9_cx_iface.c
vp9/vp9_dx_iface.c
This reverts commit a18da9760a.
Change-Id: I361442ffec1586d036ea2e0ee97ce4f077585f02
In frame parallel decode, libvpx decoder decodes several frames on all
cpus in parallel fashion. If not being flushed, it will only return frame
when all the cpus are busy. If getting flushed, it will return all the
frames in the decoder. Compare with current serial decode mode in which
libvpx decoder is idle between decode calls, libvpx decoder is busy
between decode calls. VP9 frame parallel decode is >30% faster than serial
decode with tile parallel threading which will makes devices play 1080P
VP9 videos more easily.
* frame-parallel:
Add error handling for frame parallel decode and unit test for that.
Fix a bug in frame parallel decode and add a unit test for that.
Add two test vectors to test frame parallel decode.
Add key frame seeking to webmdec and webm_video_source.
Implement frame parallel decode for VP9.
Increase the thread test range to cover 5, 6, 7, 8 threads.
Fix a bug in adding frame parallel unit test.
Add VP9 frame-parallel unit test.
Manually pick "Make the api behavior conform to api spec." from master branch.
Move vp9_dec_build_inter_predictors_* to decoder folder.
Add segmentation map array for current and last frame segmentation.
Include the right header for VP9 worker thread.
Move vp9_thread.* to common.
ctrl_get_reference does not need user_priv.
Seperate the frame buffers from VP9 encoder/decoder structure.
Revert "Revert "Revert "Revert 3 patches from Hangyu to get Chrome to build:"""
Conflicts:
test/codec_factory.h
test/decode_test_driver.cc
test/decode_test_driver.h
test/invalid_file_test.cc
test/test-data.sha1
test/test.mk
test/test_vectors.cc
vp8/vp8_dx_iface.c
vp9/common/vp9_alloccommon.c
vp9/common/vp9_entropymode.c
vp9/common/vp9_loopfilter_thread.c
vp9/common/vp9_loopfilter_thread.h
vp9/common/vp9_mvref_common.c
vp9/common/vp9_onyxc_int.h
vp9/common/vp9_reconinter.c
vp9/decoder/vp9_decodeframe.c
vp9/decoder/vp9_decodeframe.h
vp9/decoder/vp9_decodemv.c
vp9/decoder/vp9_decoder.c
vp9/decoder/vp9_decoder.h
vp9/encoder/vp9_encoder.c
vp9/encoder/vp9_pickmode.c
vp9/encoder/vp9_rdopt.c
vp9/vp9_cx_iface.c
vp9/vp9_dx_iface.c
Change-Id: Ib92eb35851c172d0624970e312ed515054e5ca64
This commit enables sub8x8 inter block coding for RTC mode. The
use of sub8x8 blocks can be turned on by allowing
choose_partitioning function to select 4x4/4x8/8x4 block sizes.
Change-Id: Ifbf1fb3888fe4c094fc85158ac3aa89867d8494a
This commit explicitly set the second reference frame type to be
NONE in key frame coding mode. This fixes a subtle dependency of
reference motion vector used by next inter frame on mode_info
reset before key frame coding.
Change-Id: I5ff0359753fdc9992b0bfe889490f7a32d7d5f6a
Where there is very subtle motion, especially when combined
with low spatial complexity, the codec sometimes fails to quickly
pick up the ambient motion field.
Once it has been established though the field propagates well using
Nearest and Near MV.
This patch looks specifically at the case where the Nearest and Near
have not been established as non zero vectors and in this case
discounts the cost of searching for a new vector in the rd code.
This will almost certainly have some implications in terms of encode
speed but it should be possible to mitigate the impact in a subsequent
using first pass stats and the local spatial complexity.
Average results for test sets approximately neutral.
Change-Id: I44a29e20f11f7ab10f8c93ffbdc50183d9801524