This patch allows the encoder to skip the search for partition
SPLIT, HORZ, VERT after the search for partition NONE is done
in RD optimization. It uses the first pass block-wise statistics
to make the decision. If all 16x16 blocks in the current partition
have zero motions and small residues from the frist pass statistics,
and it has small difference variance, further partition search is
skipped.
For speed 2 setting, experiments on general youtube clips show that
the speedup varies from 1% - 10%, 5% on average. On the performance
side in PSNR, derf 0.004%, yt -0.059%, hd -0.106%, stdhd 0.032%.
For hard stdhd clips:
park_joy_1080p, 502952 ms -> 503307 ms (-0.07%)
pedestrian_area_1080p, 227049 ms -> 220531 ms (+3%)
This feature is under the compilation flag CONFIG_FP_MB_STATS and
it is off in current setting.
Change-Id: I554537e9242178263b65ebe14a04f9c221b58bae
Remove the variable that indicates the relative block index. This
is explicitly covered by the use of pc_tree.
Change-Id: Ib13142582fff926c85e375bde656aa050add8350
This commit enables a chessboard pattern constrained partition
search for 720p and above resolutions. The scheme applies stricter
partition search to alternative blocks based on its above/left
neighboring blocks' partition range, as well as that of the
collocated blocks in the previous frame. It is currently turned
on at 16x16 block size level. The chessboard pattern is flipped
per coding frame.
The speed 3 runtime is reduced:
park_joy_1080p, 652832 ms -> 607738 ms (7% speed-up)
pedestrian_area_1080p, 215998 ms -> 200589 ms (8% speed-up)
The compression performance is changed:
hd -0.223%
stdhd -0.295%
Change-Id: I2d4d123ae89f7171562f618febb4d81789575b19
vp9_variance16x16(), and vp9_get16x16var().
On a Nexus 7, vpxenc (in realtime mode, speed -12)
reported a performance improvement of ~16.7%.
Change-Id: Ib163aa99f56e680194aabe00dacdd7f0899a4ecb
currently the only way to know if multiple alt-refs are enabled is to
inspect the encoder instance.
this reduces the size of the allocation by 75% when not using multiple
alt-refs
Change-Id: Ie4baa240c2897e64b766c6ad229674884b5a65b6
This commit replace the repetitive retrieve of max and min allowed
partition from speed_feature with local variables max_size and
min_size.
Change-Id: Ib06f11f16615e4876e4dd5fb6a968c6bf5f7b216
The get_chessboard_index() used to call the entire VP9_COMMON
struct pointer to retrieve the chessboard pattern index. This cl
makes it call the frame index directly.
Change-Id: I3cad9d209ea2e77a358085a04fe1ff0ddec5ba03
Remove the redundant index computation when store the first
pass block-wise statistics. Currently, a single byte is
allocated for a 16x16 blocks, and all the frame statistics
saved during the first pass will be kept in memory for use
in the second pass. For a 1920x1080 300-frame clip, it will
take about 2.3 MB memory. This feature is off in current
setting.
Change-Id: I135a95b348ec093d54c6a07e1e8237626909e3bd
Remove all the redundant dct functions (dct4x4, dct8x8)
in avx2 except dct32x32 those functions were copied originally from dct_sse2
Change-Id: I742576fbf5175f3ac09f2076976a9247b259323e
The issue was introduced by commit g9f37d14 with adding explicit
restrictions on reference-frame scale factors. The restriction
is checked against aligned-by-8 frame dimensions, not against
original ones. So, for example, frame of 35×35 actually can refer
to frame of 70×70, but the new check won't allow this. It will
compare 35 vs 72 (not 70), so 2x downscale limit will be exceeded.
Change-Id: Ic663693034440f64ac8312cbff9e1e773a921060
The partition search for 4x4 blocks takes unnecessary steps to
reconstruct pixels and an extra partition type update. This commit
removes such operations. No visible compression/speed difference.
Thanks to Yue (yuec@) for finding this issue.
Change-Id: I3f83824aa3fd3717d63be0b280fa57258939a70a
This commit turns on the existing vp9_get_prob function using
64 bit in the intermediate step. It fixes the ioc issue for 4K
above frame sizes (issue 828).
Change-Id: I9f627f3beca2c522f73b38fd2a3e7eefdff01a7c
The assignment of the variable mode_excluded in
vp9_rd_pick_inter_mode_sub8x8 takes redundant conditional jump.
This commit removes it.
Change-Id: Ie195fbe6e54ec2ade7093d562c456a2e93143704
A previous change, https://gerrit.chromium.org/gerrit/#/c/70632,
introduced a size validation for reference frames to insuare the
input stream is a valid VP9 stream. However, the logic requiring
all reference frames have valid size turned out to be too strict.
In this commit, we modify the validation to require one of the
reference frame has valid dimension. In addition, the decoder
reports error whenever it detects the use of reference frame
with invalid scalig ratio.
Change-Id: If8efc312244087556cfe00f1fcbdff811268ebad
The patch:
https://gerrit.chromium.org/gerrit/#/c/70814/
changed the test that determined whether the context
frame buffers needed to be reallocated or not.
The code checked for a change in total frame area
to signal the need to reallocate context buffers.
However, the above_context buffer needs to be
resized i:xf only the width of the frame has increased.
Change-Id: Ib89d75651af252908144cf662578d84f16cf30e6
1. Remove last reference flag for first frame upper layers in one pass mode.
2. Disable refresh golden frame flag for key frames.
Change-Id: I44ac1bd2c795169e4fbfdd078ea79a1d33a204d6
The value of mode_excluded has been properly set in
vp9_rd_pick_inter_mode_sb(). It is redundant to send it in
handle_inter_mode() and re-set the value again.
Change-Id: I408d4731f2f42e0bcf3ae62e85757717bb410471
This commit extends the chessboard pattern prediction filter search.
If the above and left blocks have the same prediction filter type,
the encoder will skip the prediction filter type search and use the
reference one.
The overall chessboard pattern prediction filter type search reduces
speed 3 runtime for hard clips. Experiments on park joy at 1080p
and 15000 kbps show that the runtime goes from 723265 ms to 65832 ms,
i.e., about 10% speed-up. Compression performance wise, it affects
the coding quality by
Change-Id: I880975497c7ad166532e9eea9bf46684d77ff327
derf: -0.326%
yt: -0.257%
hd: -0.241%
stdhd: -0.417%
This commit enables a chessboard pattern prediction filter type
search scheme for rate-distortion optimization speed-up. For the
inferred motion vector modes, the encoder can re-use its above/left
neighbor blocks' prediction filter type and skip a full test on
all possible filter types. Such operation is turned on/off
alternatively in a chessboard manner.
It is turned on in speed 3. For test clip pedestrian 1080p, the
runtime is reduced from 231500 ms -> 221700 ms. The compression
performance is changed:
derf: -0.147%
yt: -0.134%
hd: -0.079%
stdhd: -0.220%
Change-Id: I1912f278e7576c2dc632688e3ad7a257410c605a
When OUTPUT_YUV_DENOISED is enabled the encoder outputs the uncompressed,
denoised video to a separate file. Moved the point at which the file is
written to in order to avoid an extra blank frame at the beginning of the video.
Change-Id: I805f6a912b18b3d9cae59b13c5b8108279439ce3
This should be a local variable. Move the definition from
vp9_rd_pick_inter_mode_sb to handle_inter_mode.
Change-Id: I14f4168bb1c896ed04e8f6d4cd89fbf4c9839944
For sequences of resolution below 720p, the encoder will check
intra prediction modes and inter prediction modes from LAST_FRAME.
This commit turns on adaptive prediction filter scheme for sub8x8
blocks, where inter prediction modes are enabled. For the test
sequence bus at CIF, the speed 2 runtime goes down from 17879 ms
to 16783 ms, i.e., 6% speed up. The compression performance of
derf set is down by -0.128%.
Change-Id: I01d5321a5ceab4e0666ac5be56c52d896c7a8d45
The commit moved a call to vp9_clear_system_state() to a correct
location, i.e. prior function calls using floating point numbers.
This was to fix a mismatch mmx code and sse2 version, where a
floating point number used in adjust_frame_rate(cpi) gets NAN due
to mmx registers being in wrong state.
Change-Id: I40e0a6de98812000ccee6a729badb630604fd7e6
For gcc, when libvpx config option debug is disabled, added the
flag -DNDEBUG to disable the assertions in libvpx for some speedup.
Change-Id: Ifcb7b9e8ef5cbe5d07a24407b53b9a2923f596ee
This patch adds back in code that checks that the frame
size lies within defined bounds was inadvertantly removed
by a previous patch:
https://gerrit.chromium.org/gerrit/#/c/70814/
Change-Id: If526570ba559260c4b7e98098bc75f7700ae7f97
Separates HBD profile int two profiles (2 and 3) consistent with the
highbitdepth branch. This patch is ported from the original highbitdepth
branch patch: https://gerrit.chromium.org/gerrit/#/c/70460/
Two of the invalid file tests needed to be updated.
Change-Id: I6a4acd2f7a60b1fb4cbcc8e0dad4eab4248431e3
The bug sets the wrong pointer to the first pass mb stats
if the encoder does the re-coding in the second pass.
Change-Id: I8a11f45dd7dceb38de814adec24cecccae370d00
This patch is the first step toward simplifying the
frame buffer handling.
The final goal is to have a common frame buffer handling
framework for both encoder and decoder that incorporates
the existing ability to use externally allocated memory.
Change-Id: I2c378a4f54a39908915f46c4260e17a080db7ff1
This is a practical concern to allow us to fail in a decoder instance
if the size of a file is bigger than we can reasonably handle.
Change-Id: I0446b5502b1f8a48408107648ff2a8d187dca393
This commit changed the hard-coded DEFAULT_INTERP_FILTER to a speed
feature with the same default value: SWITCHABLE.
Change-Id: I7f54f40f1bd3f5277841d04b85db7a84e47313f1
and vp9_sad16x16_neon()
On a Nexus 7, vpxenc (in realtime mode, speed -6)
reported a performance improvement of ~17%.
Change-Id: I91e070cde2973451083d3f3d63b49b7886de9a85
2 pass only change to calculation of rd mult based on Q.
Make a small adjustment based on frame type and also
replace adjustment based on iifactor with an one based
on the ambient GF/ARF boost level.
Also fix multi arf bug / issue.
Overall these change give an slight improvement in ssim
but hurt psnr a little.
Change-Id: I5e1751e3ff5390a26f543d7855059e6fbcce105e
We target this speed to achieve similar encoding speed and better
compression than vp8 rt mode with cpu-used at -12.
Change-Id: Ic1bb4371c81a17ea80e83459c1cbf4c09a3498e8
The issue was introduced by commit g7c43fb6. If current frame
is repeated from existing-ref pool, frame buffer ref counter
is not decreased, so buffer isn't released. Decoder fails being
unable to allocate new frame buffer at some point.
Added a test vector to verify that the condition will not
recur later. Test vector was generated by the code in this patch:
https://gerrit.chromium.org/gerrit/#/c/70862/
Change-Id: I8af96eb5b9670176e01a281d2e18bd458712cf78
In vp8, statistics are collected about the different modes as they are searched.
This process is more complicated due to the variable block size. Fields were
added to the PICM_MODE_CONTEXT struct to hold this information for each point in
the search. The information is then taken from the appropriate part of the tree
during denoising.
Change-Id: I89261ab77ad637821287ae157dfdf694702b8e77
All changes are for spatial svc only.
1. Enable encoding hidden frames in each layer and use alt reference idex to reference the hidden frame in each layer
2. Use golden reference idx for spatial reference
3. For those layers that don't have hidden frames (caused by lack of frame buffers), reference a hidden frame in lower layers
4. Add "auto-alt-refs" in svc options
Change-Id: Idf27d1fd2fb5f3ffd9e86d2119235e3dad36c178
This commit fixes a potential out-of-boundary memory access due to
the use of reuse_inter_pred_sby in the non-RD coding flow. It
resolves the corresponding asan error.
Change-Id: Iff605f5921230966990013541cd855d698810922
This commit fixes a mismatched use case of block size in non-RD
intra prediction check. The residual SSE and variance should be
calculated per transform block size, instead of operating block
size, which caused chrome valgrind warning on conditional jump
based on uninitialized value (webm issue 823). This commit
resolves this issue.
Change-Id: I595c06599c7e0fd0e4a08736519ba68fc14bc79a
Also fix bugs related with corrupted frame handling.
Return VPX_CODEC_CORRUPT_FRAME when getting corrupted
block.
Change-Id: I7207ccc7c68c4df2b40b561315d16e49ccf7ff41
Refactoring to remove some duplication of probability
tables between tokenization and detokenization.
Change-Id: I2fc6a6497f9c0410021a9b41f828bc58a864e466
Use a weaker filter for second level arf frames.
Average gain across all sets and metrics ~0.3%
Remove code for arnr_type which is no longer
supported in VP9 which always uses a centered blur.
Re-factor and some cleanup.
Change-Id: Ieb4b8940e99e4e02b3fcc9fca6f2d4109e6ed639
pull the latest from libwebp.
Original source:
http://git.chromium.org/webm/libwebp.git
100644 blob 264210ba2807e4da47eb5d18c04cf869d89b9784 src/utils/thread.c
commit 46fd44c1042c9903b2f1ab87e9f200a13c7e702d
Author: James Zern <jzern@google.com>
Date: Tue Jul 8 19:53:28 2014 -0700
thread: remove harmless race on status_ in End()
if a thread was still doing work when End() was called there'd be a race
on worker->status_. in these cases, however, the specific value is
meaningless as it would be >= OK and the thread would have been shut
down properly, but we'll check 'impl_' instead to avoid any potential
TSan/DRD reports.
Change-Id: Ib93cbc226a099f07761f7bad765549dffb8054b1
Change-Id: Ib0ef25737b3c6d017fa74822e21ed58508230b91
Currently, vp9_diamond_search_sadx4() is only called when sse3 is
enabled, which is improper since sse2 optimization of sdx4df
functions are available. Changed to always use
vp9_diamond_search_sadx4().
Change-Id: I4b95d6b7a3c6c645783c373f0ba8d645ece24717
- fix indent, spelling
- drop some whitespace in some comments
- add an assert in vp9_setup_mask, it shouldn't be called on decode
error
Change-Id: Ic312a815e977a6f9cb81ceb7b039eeada76c5aa0
There are sse2 optimization of sdx4df functions. Instead of calling
vp9_refining_search_sadx4 only when sse3 is enabled, call it always.
Change-Id: I24f93818f7d4209d1425039e0eb099ff9ff08fe9
Use FAST_HEX in speed 5 and 6, which covers more points than
FAST_DIAMOND and improves motion search quality.
At speed 6, RTC set borg tests showed slight quality gain (psnr
gain: 0.143%, ssim gain: 0.226%). No noticeable encoding speed
change.
Change-Id: Ifa62875d9a52ee382ec494f271382bb77d8c67bf
This commit combined the full pel and sub pel motion search into a
single function to avoid code duplication. The commit does not change
encoder outputs.
Change-Id: Ibe18342c4f64073bef20f9cf6c6ca0a20d01bf0d
This commit enables a new quantization process for 32x32 2D-DCT
transform coefficient blocks. It improves the compression
performance of speed 5 by 1.4%. The overall compression gains of
speed 5 due to the new quantization scheme is 4.7%. It also includes
the SSSE3 implementation of the 32x32 quantization process.
Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b
This patch fixes bug 633:
https://code.google.com/p/webm/issues/detail?id=633
The first decoded frame does not have to be a keyframe,
it could be an inter-frame that is coded intra-only.
This patch fixes the handling of intra-only frames.
A test vector has also been added that encodes 3
intra-only frames at the start of the clip. The
test vector was generated using the code in the
following patch:
https://gerrit.chromium.org/gerrit/#/c/70680/
Change-Id: Ib40b1dbf91aae2bc047e23c626eaef09d1860147
In the previous version, only certain buffers in the macroblockd were saved and
the restored. In this version, all of the buffers are saved and restored. The
code was then rolled into a loop for readability.
Also contains a tiny fix for when the -DOUTPUT_YUV_DENOISED flag is used.
Change-Id: Id925ef8b3fa122ae88acfa1d9a1e4df45df83518
Prepare for frame parallel decoding, the reference count buffers
need to be protected by mutex. Move vp9_thread.* to common
folder so that those buffers could use cross-platform mutex
from vp9_thread.*.
Change-Id: I541277cf15eefed6641555944f67f4a0bcdc8154
* Replace max_step_search_steps with constant MAX_MVSEARCH_STEPS
* Fold (reduce_first_step_size + speed > 5) into reduce_first_step_size
replacing uses of reduce_first_step_size that don't add the speed
check with zero.
Change-Id: Iae46395dbf3eaca138bf4d18b838a9e364b5a198
The y4m extension used is the same as the one used in ffmpeg/x264.
The patch is adapted from the highbitdepth branch.
Also adds unit tests for y4m header parsing and md5 check
of the raw frame data, as well as y4m writing.
[build fix for Mac/VS by not using tuples with strings]
Change-Id: I40897ee37d289e4b6cea6fedc67047d692b8cb46
vp9_rdopt is for making rd optimal mode decisions. vp9_rd is for all
other rd related routines. Anything used outside of making an rd optimal
decision belongs in rd.
Change-Id: I772a3073f7588bdf139f551fb9810b6864d8e64b
Moved the threshold adjustment before reference flag checking,
which could set the threshold to INT_MAX for disabled reference
frame, and cause overflow if the adjustment is done after that.
Change-Id: I85e94f8726d5e3ae93f65965aa978721dddc9957
Renamed updating_running_avg() to filter(). Extended function with the rest of
the filter procedure. Made all of the empirically-determined constants used in
VP8 into functions so they can be tweaked more easily.
Change-Id: I41730c8c92370c76885950a43742347477ca4e7e
As in VP8.
Currently, this parameter is set with the VP8E_SET_NOISE_SENSITIVITY flag.
The flag was not renamed so that we don't break the interface for webrtc. This
should probably be changed at some point in the future.
Change-Id: Ic73fcb0dde9d1d019e9d042050b617333ac65472
Add test code to turn multi-arf on and off depending
on group length and zero motion.
Changes to active max group length for mult-arf.
Fund second arf only from normal frame bits.
Change-Id: I920287fac1c886428c15a39f731a25d07c2b796c
This commit added a speed feature to control the step_param used in
full pixel motion search. The intention is to reduced the search
steps for high speed real time coding.
Change-Id: I21d2f0105c2b647783a6688615da7fcf2b6d670b
Adapt the use of segmentation in AQ mode 2 based on
the ambient kf/arf/gf Q.
Disable segmentation where the rate per SB is very
low and overheads are likely to outweigh the benefits.
This patch reduces the -ve average metrics impact
of AQ mode 2 while allowing stronger 3 segment AQ
in some cases. Average improvement ~0.5-1.0%.
Change-Id: I5892dfcc7507c5cc6444531cc7fe17554cf8d0c7
The y4m extension used is the same as the one used in ffmpeg/x264.
The patch is adapted from the highbitdepth branch.
Also adds unit tests for y4m header parsing and md5 check
of the raw frame data, as well as y4m writing.
Change-Id: Ie2794daf6dbafd2f128464f9b9da520fc54c0dd6
This commit re-designs the quantization process for transform
coefficient blocks of size 4x4 to 16x16. It improves compression
performance for speed 7 by 3.85%. The SSSE3 version for the
new quantization process is included.
The average runtime of the 8x8 block quantization is reduced
from 285 cycles -> 255 cycles, i.e., over 10% faster.
Change-Id: I61278aa02efc70599b962d3314671db5b0446a50
Add a conditional compile flag for this feature. Also add a
switch to enable the encoder to use these statistics in the
second pass. Currently, the switch is turned off.
Change-Id: Ia1c858c35ec90e36f19f5cffe156b97ddaa04922
The current threshold is knid of low, and in many cases NEWMV
mode is checked but not picked as the best mode. This patch
added a speed feature to increase NEWMV threshold, so that
less partition mode checking goes to check NEWMV. This feature
is enabled for speed 6 and 7.
Rtc set borg tests showed:
1. Speed 6, overall psnr: -0.088%, ssim: -1.339%;
Average speedup on rtc set is 11.1%.
2. Speed 7, overall psnr: -0.505%, ssim: -2.320%
Average speedup on rtc set is 12.9%.
Change-Id: I953b849eeb6e0d5a1f13eacba30c14204472c5be
pull the latest from WebP, which adds a worker interface abstraction
allowing an application to override init/reset/sync/launch/execute/end
this has the side effect of removing a harmless, but annoying, TSan
warning.
Original source:
http://git.chromium.org/webm/libwebp.git
100644 blob 08ad4e1fecba302bf1247645e84a7d2779956bc3 src/utils/thread.c
100644 blob 7bd451b124ae3b81596abfbcc823e3cb129d3a38 src/utils/thread.h
Local modifications:
- s/WebP/VP9/g
- camelcase functions -> lower with _'s
- associate '*' with the variable, not the type
Change-Id: I875ac5a74ed873cbcb19a3a100b5e0ca6fcd9aed
The caller should reset the state instead of letting worker
to reset.
This reverts commit 34b2ce15f9.
Change-Id: Idb546ea6386cffc44e98dee772900d21ab79710f
Encoder still uses SWITCHABLE as default via DEFAULT_INTERP_FILTER,
but does not override the default if it is not SWITCHABLE.
Change-Id: I3c0f6653bd228381a623a026c66599b0a87d01d5
When the frame is intra coded only, the encoder takes the RD
coding flow. Hence the function set_mode_info is not practically
in use. This commit removes it and the associated conditional
branches.
Change-Id: I1e42659ceb55b771ba712d1cdecacb446aa6460d
This fixes the hang in VP9/InvalidFileTest.ReturnCode/3
due to worker->had_error has not been reset after getting
error.
Change-Id: Ia3608225094758a2bd88f6ae4dd9dfd93bbaad27
For real time speed 7, once encode breakout is on(i.e. encoding
setting --static-thresh=1), a proper encode breakout threshold
is set to speed up the encoder.
Set --static-thresh=1, RTC set borg test showed a slight overall
psnr loss of 0.162%, but ssim gain of 0.287%. The average speedup
on RTC set is 6%, and for some clips, the speedup can be 10+%.
Change-Id: Id522d9ce779ff7c699936d13d0c47083de4afb85
Before encoding a frame, calculate and store each 16x16 block's
variance of source difference between last and current frame.
Find partitioning threshold T for the frame from its variance
histogram, and then use T to make partition decisions.
Comparing with fixed 16x16 partitioning, rtc set test showed an
overall psnr gain of 3.242%, and ssim gain of 3.751%. The best
psnr gain is 8.653%.
The overall encoding speed didn't change much. It got faster for
some clips(for example, 12% speedup for vidyo1), and a little
slower for others.
Also, a minor modification was made in datarate unit test.
Change-Id: Ie290743aa3814e83607b93831b667a2a49d0932c
the buffer is only used in encoding and only when
CONFIG_INTERNAL_STATS or CONFIG_VP9_POSTPROC is enabled.
a future change should decouple this from the frame buffer allocation
and make it conditional based on runtime flags when the above config
options are enabled.
reduces decode heap usage by at least 12%
Change-Id: Id0b97620d4936afefa538d3aadf32106743d9caf
This reverts commit b336356198.
This causes a hang in:
VP9/InvalidFileTest.ReturnCode/3
the change to test/user_priv_test.cc remains with a minor update
Change-Id: I4a8a272ca37ea329b0f413f0b1cd827a238bd9fd
This patch checks that a decoder never tries to reference frame that's
outside the range of 2x to 1/16th the size of this frame. Any attempt
to do so causes a failure.
Change-Id: I5c98fa7bb95ac4f29146f29dd92b62fe96164e4c
Bug introduced in I930dced169c9d53f8044d2754a04332138347409. If
svc.number_temporal_layers == 1 and svc.number_spatial_layers == 1, the system
attempt to do spatial SVC. It no longer does that.
Change-Id: Ie6b130a72b1eea40c547c9a64447e40695f811c5
For the primary arf in a group, if multiple arfs
are enabled and we were using arfs in the previous
group, then allow the second arf from the previous
group to be used as an additional reference.
Change-Id: Iaf41706a52f54ef21548026851cd77100d6aebda
This commit enables an adaptive transform size selection method
for speed -6. It uses largest transform size when the sse is more
than 4 times of variance, i.e., most energy is compacted in the
DC coefficient. Otherwise, use the default TX_8X8. It improves
the compression efficiency for rtc set of speed -6 by 0.8%, no
speed change observed.
Change-Id: Ie6ed1e728ff7bf88ebe940a60811361cdd19969c
This patch allows the encoder to skip the partition search for the
frame if it is an inter frame and only zero motion vectors have
been detected in the first pass. The partition size is directly
assigned according to the difference variance.
Borg tests show overall little performance changes in term of PSNR
(derf -0.027%, yt 0.152%, hd 0.078%, stdhd 0%). The worst case of
PSNR loss is -0.514% from yt. The best PSNR gain is 4.293% from yt.
The second pass encoding speedup for slideshow clips is 15%-40%.
Change-Id: I881f347d286553ee5594a9ea09ba1a61ac684045
This commit enables a fast reference motion vector search scheme.
It checks the nearest top and left neighboring blocks to decide the
most probable predicted motion vector. If it finds the two have
the same motion vectors, it then skip finding exterior range for
the second most probable motion vector, and correspondingly skips
the check for NEARMV.
The runtime of speed -5 goes down
pedestrian at 1080p 29377 ms -> 27783 ms
vidyo at 720p 11830 ms -> 10990 ms
i.e., 6%-8% speed-up.
For rtc set, the compression performance
goes down by about -1.3% for both speed -5 and -6.
Change-Id: I2a7794fa99734f739f8b30519ad4dfd511ab91a5
Bug introduced during multiple iterations on: I3831*
gf_group->arf_update_idx[] cannot currently be used
to select the arf buffer index if buffer flipping on overlays
is enabled (still currently the case when multi arf OFF).
Change-Id: I4ce9ea08f1dd03ac3ad8b3e27375a91ee1d964dc
This commit fixes the potential issue in the non-RD mode decision
flow that only checks part of the block to estimate the cost. It
was due to the use of fixed transform size, in replacing the
largest transform block size. This commit enables per transform
block cost estimation of the intra prediction mode in the non-RD
mode decision.
Change-Id: I14ff92065e193e3e731c2bbf7ec89db676f1e132
same reasoning as:
9f3a0db vp9_rtcd: correct avx2 references
these are all intrinsics, so don't depend on x86inc.asm
Change-Id: I915beaef318a28f64bfa5469e5efe90e4af5b827
This patch reverts the previous revert from Jim and also add a
variable user_priv in the FrameWorker to save the user_priv
passed from the application. In the decoder_get_frame function,
the user_priv will be binded with the img. This change is needed
or it will fail the unit test added here:
https://gerrit.chromium.org/gerrit/#/c/70610/
This reverts commit 9be46e4565.
Change-Id: I376d9a12ee196faffdf3c792b59e6137c56132c1
Cosmetic patch only in response to comments on
previous patches suggesting a couple of name changes
for consistency and clarity.
Change-Id: Ida3a359b0d5755345660d304a7697a3a3686b2a3
the max is 6. there are assumptions throughout the decode regarding
this; fixes a crash with a fuzzed bitstream
$ zzuf -s 5861 -r 0.01:0.05 -b 6- \
< vp90-2-00-quantizer-00.webm.ivf \
| dd of=invalid-vp90-2-00-quantizer-00.webm.ivf.s5861_r01-05_b6-.ivf \
bs=1 count=81883
Change-Id: I6af41bb34252e88bc156a4c27c80d505d45f5642
This commit replaces a few use cases of cpi->common with preset
variable cm, to avoid unnecessary pointer fetch in the non-RD
coding mode.
Change-Id: I4038f1c1a47373b8fd7bc5d69af61346103702f6
In real-time speed 6, no partition search is done. The inter
prediction results got from picking mode can be reused in the
following encoding process. A speed feature reuse_inter_pred_sby
is added to only enable the resue in speed 6.
This patch doesn't change encoding result. RTC set tests showed
that the encoding speed gain is 2% - 5%.
Change-Id: I3884780f64ef95dd8be10562926542528713b92c
There is a normative scaling range of (x1/2, x16)
for VP9. This patch fixes the maximum downscaling
tests that are applied in the convolve function.
The code used a maximum downscaling limit of x1/5
for historic reasons related to the scalable
coding work. Since the downsampling in this
application is non-normative it will revert to
using a separate non-normative scaler.
Change-Id: Ide80ed712cee82fe5cb3c55076ac428295a6019f
Add indirection to the section of buffer indices.
This is to help simplify things in the future if we
have other codec features that switch indices.
Limit the max GF interval for static sections to fit
the gf_group structures.
Change-Id: I38310daaf23fd906004c0e8ee3e99e15570f84cb
Fix some bugs relating to the use of buffers
in the overlay frames.
Fix bug where a mid sequence overlay was
propagating large partition and transform sizes into
the subsequent frame because of :-
sf->last_partitioning_redo_frequency > 1 and
sf->tx_size_search_method == USE_LARGESTALL
Change-Id: Ibf9ef39a5a5150f8cbdd2c9275abb0316c67873a
This patch implements a mechanism for inserting a second
arf at the mid position of arf groups.
It is currently disabled by default using the flag multi_arf_enabled.
Results are currently down somewhat in initial testing if
multi-arf is enabled. Most of the loss is attributable to the
fact that code to preserve the previous golden frame
(in the arf buffer) in cases where we are coding an overlay
frame, is currently disabled in the multi-arf case.
Change-Id: I1d777318ca09f147db2e8c86d7315fe86168c865
The encoder currently allocates frame buffers before
it establishes what the chroma sub-sampling factor is,
always allocating based on the 4:4:4 format.
This patch detects the chroma format as early as
possible allowing the encoder to allocate buffers of
the correct size.
Future patches will change the encoder to allocate
frame buffers on demand to further reduce the memory
profile of the encoder and rationalize the buffer
management in the encoder and decoder.
Change-Id: Ifd41dd96e67d0011719ba40fada0bae74f3a0d57
This patch insures that the last byte of a chunk that contains a
valid superframe marker byte, actually has a proper superframe index.
If not it returns an error.
As part of doing that the file : vp90-2-15-fuzz-flicker.webm now fails
to decode properly and moves to the invalid file test from the test
vector suite.
Change-Id: I5f1da7eb37282ec0c6394df5c73251a2df9c1744
Avoids failures:
MSE_ClearKey/EncryptedMediaTest.Playback_VP9Video_WebM/0
MSE_ClearKey_Prefixed/EncryptedMediaTest.Playback_VP9Video_WebM/0
MSE_ExternalClearKey_Prefixed/EncryptedMediaTest.Playback_VP9Video_WebM/0
MSE_ExternalClearKey/EncryptedMediaTest.Playback_VP9Video_WebM/0
MSE_ExternalClearKeyDecryptOnly/EncryptedMediaTest.Playback_VP9Video_WebM/0
MSE_ExternalClearKeyDecryptOnly_Prefixed/EncryptedMediaTest.Playback_VP9Video_WebM/0
SRC_ExternalClearKey/EncryptedMediaTest.Playback_VP9Video_WebM/0
SRC_ExternalClearKey_Prefixed/EncryptedMediaTest.Playback_VP9Video_WebM/0
SRC_ClearKey_Prefixed/EncryptedMediaTest.Playback_VP9Video_WebM/0
Patches are
This reverts commit 9bc040859b
This reverts commit 6f5aba069a
This reverts commit 9bc040859b
I1f250441 Revert "Refactor the vp9_get_frame code for frame parallel."
Ibfdddce5 Revert "Delay decreasing reference count in frame-parallel decoding."
I00ce6771 Revert "Introduce FrameWorker for decoding."
Need better testing in libvpx for these commits
Change-Id: Ifa1f279b0cabf4b47c051ec26018f9301c1e130e
When decoding in serial mode, there will be only
one FrameWorker doing decoding. When decoding in
parallel mode, there will be several FrameWorkers
doing decoding in parallel.
Change-Id: If53fc5c49c7a0bf5e773f1ce7008b8a62fdae257
See: https://code.google.com/p/chromium/issues/detail?id=362697
The code properly catches an invalid stream but seg faults instead of
returning an error due to a buffer not having been initialized. This
code fixes that.
Change-Id: I695595e742cb08807e1dfb2f00bc097b3eae3a9b
s/stdint.h/vpx\/vpx_int.h
Added missing 'break;'s
Also included other minor changes, mostly cosmetic.
Change-Id: I852bba3e85e794f1d4af854c45c16a23a787e6a3
The test for this is in test vector code ( show existing frames will
fail ). I can't check it in disabled as I'm changing the generic
test code to do this:
https://gerrit.chromium.org/gerrit/#/c/70569/
Change-Id: I5ab324f0cb7df06316a949af0f7fc089f4a3d466
This commit allows the key frame to search through more prediction
modes and more flexible block sizes. No speed change observed. The
coding performance for rtc set is improved by 1.7% for speed -5 and
3.0% for speed -6.
Change-Id: Ifd1bc28558017851b210b4004f2d80838938bcc5
A superframe is a bunch of frames that bundled as one frame. It is mostly
used to combine one or more non-displayable frames and one displayable frame.
For frame parallel decoding, libvpx decoder will only support decoding one
normal frame or a super frame with superframe index.
If an application pass a superframe without superframe index or a chunk
of displayable frames without superframe index to libvpx decoder, libvpx
will not decode it in frame parallel mode. But libvpx decoder still could
decode it in serial mode.
Change-Id: I04c9f2c828373d64e880a8c7bcade5307015ce35
This breaks the profile 1 bitstream.
Don't force non420 uv transform size to 1/4 y size. In the 4:2:0 case the
chroma corresponding to a luma block is 1/4 its size. In the 4:4:4 case
chroma and luma planes are the same size. Disallowing larger transforms
can result in a loss of compression efficiency and is inconsistent.
For sub-8x8 blocks only average corresponding motion vectors.
4:2:0 and profile 0 behavior remains unchanged.
Change-Id: I560ae07183012c6734dd1860ea54ed6f62f3cae8
Speed 6 uses small tx size, namely 8x8. max_intra_bsize needs to
be modified accordingly to ensure valid intra mode checking.
Borg test on RTC set showed an overall PSNR gain of 0.335% in speed
-6.
This also changes speed -5 encoding by allowing DC_PRED checking
for block32x32. Borg test on RTC set showed a slight PSNR gain of
0.145%, and no noticeable speed change.
Change-Id: I1502978d8fbe265b3bb235db0f9c35ba0703cd45
This is the first step to rework the rate-distortion modeling used
in rtc coding mode. The overall goal is to make the modeling
customized for the statistics encountered in the rtc coding.
This commit makes encoder to perform rate-distortion modeling for
DC and AC coefficients separately. No speed changes observed.
The coding performance for pedestrian_area_1080p is largely
improved:
speed -5, from 79558 b/f, 37.871 dB -> 79598 b/f, 38.600 dB
speed -6, from 79515 b/f, 37.822 dB -> 79544 b/f, 38.130 dB
Overall performance for rtc set at speed -6 is improved by 0.67%.
Change-Id: I9153444567e5f75ccdcaac043c2365992c005c0c
This patch allows the VP9 encoder to skip the un-necessary
motion search in the first pass. It computes the motion error
of 0,0 motion using the last source frame as the reference,
and skips the further motion search if this error is small.
Borg test shows overall the patch gives PSNR gain (derf -0.001%,
yt 0.341%, hd 0.282%). Individual clips may have PSNR gain or
loss. The best PSNR performance is 7.347% and the worst is -0.662%.
The first pass encoding speedup for slideshow clips is over 30%.
Change-Id: I4cac4dbd911f277ee858e161f3ca652c771344fe
This commit fixes frame header decoding for superframe index, to
prevent out of boundary memory read triggered by fuzz test
vector. It resolves a chromium security violation issue
crbug.com/376802.
The issue was introduced in the change:
Add VPXD_SET_DECRYPTOR support to the VP9 decoder.
cl-id I88f86c8ff9af34e0b6531028b691921b54c2fc48
where the buffer was read before validation check on index offset
applied.
A test vector is added accordingly.
Change-Id: I41c988e776bbdd1033312a668e03a3dbcf44ca99
This patch appears to have introduced non-determinism and/or
mismatch from debug vs release.
This reverts commit 5daef90efc.
Change-Id: I80081e55cfeaaa821b510b58a4e6e6328003c7da
The current decoding scheme will decrease the reference count
of the output frame when finish decoding. Then the application
could copy the frame from the decoder buffer to application buffer.
In frame-parallel decoding, a decoded frame will not be outputted
until several frames later which depends on thread numbers. So
the decoded frame's reference count should be decreased only
after application finish copying the frame out. But due to the
limitation of vpx_codec_get_frame, decoder could not know when
application finish decoding. So use a index last_show_frame to
release the last output frame's reference count.
Change-Id: I403ee0d01148ac1182e5a2d87cf7dcc302b51e63
This commit enables a fast path computational flow for forward
transformation. It checks the sse and variance of prediction
residuals and decides if the quantized coefficients are all
zero, dc only, or more. It then selects the corresponding coding
path in the forward transformation and quantization stage.
It is currently enabled in rtc coding mode. Will do it for rd
coding mode next.
In speed -6, the runtime for pedestrian_area 1080p at 1000 kbps
goes down from 14234 ms to 13704 ms, i.e., about 4% speed-up.
Overall coding performance for rtc set is changed by -0.18%.
Change-Id: I0452da1786d59bc8bcbe0a35fdae9f623d1d44e1
This patch allows the encoder to skip the
un-neccessary motion search in the first pass. It
calculates the error of the zero motion vector using
the last source frame as reference and skips the
further motion search in the first pass if the error
is small.
The encoding speedup of the first pass for slideshow
videos is over 30%. Borg test shows the overall PSNR
performance remain approximately the same (derf -0.009,
hd 0.387, yt 0.021, stdhd 0.065). Individual clips may
have either PSNR gain or loss. The worst PSNR perfomance
is from yt set, with a PSNR loss of -1.1.
Change-Id: I08b2ab110b695e4689573b2567fa531b6457616e
* Only use ZEROMV, disalowing the intra modes that were previously
tested.
* Score rate and distortion as zero.
Change-Id: Ifcf99e272095725f11da1dcd26bd0f850683e680
In non frame-parallel decoding, this works the same way as
current decoding scheme. Every time after decoder finish
decoding a frame, it will swap the current mode info pointer
and previous mode info pointer if the decoded frame needs
to be shown. Both mode info pointer and previous mode info
pointer are from mode info arrays.
In frame-parallel decoding, this will become more complicated
as current frame's mode info pointer will be shared with next
frame as previous mode info pointer. But when one decoder
thread finishes decoding one frame and starts to work on next
available frame, it needs to retain the decoded frame's mode
info pointers until next frame finishes decoding. The mode info
index will serve this purpose. The decoder will use different
buffer in the mode info arrays and use the other buffer to save
previous decoded frame’s mode info.
Change-Id: If11d57d8eb0ee38c8876158e5482177fcb229428
tests failing under Win32/Win64
+ dct16x16_test: add missing avx2 functions (partially disabled)
exercises the forward transforms
no idct/iht implementations, so the c-code is used
Change-Id: I04f64a457fa0828a00f32b5c9fe4f55294f21f61
In non-rd real-time mode, choosing smaller transform size in
encoding gives better video quality and good speed gain than
choosing larger transform size. This patch set tx size search
method to ALLOW_8X8, which is better than using 4x4 or other
larger sizes.
Borg tests on rtc set at speed 6 showed significant gain on quality.
PSNR gain: 11.034% and SSIM gain: 15.466%.
The speed gain is 5% - 12% for <720p clips, and 2% - 7% for
720p clips.
Change-Id: If4dc74ed2df359346b059f47fb73b4a0193ec548
Use of stack frame variable "fps" beyond the lifetime of the function.
fps is sent as a paremeter to output_stats and stored in the
packet holding this encoded frame. This has scope beyond the
lifetime of the calling function.
This reverts commit 3f95a230c7
Change-Id: Icd8e14b3d7dd733590ada12e619b9dce95b6b0f5
The SSSE3 implementation might find a potential overflow issue in
its second 1-D transform, if all input residual pixels are close to
255. This commit fixes the issue and re-enables the unit test on
the SSSE3 version.
Change-Id: I0520478abdab7afd3ff2842516bec951111e9b3c
Right now there is just one place to check: xd->lossless and for the first
pass there is a function is_lossless_requested().
Change-Id: I949a6834e64ce51e422e2892f097f2b871b5429a
In Aq mode 2 for kf/arf/gf the segment q delta
is calculated and then applied by re-quantization without
going through the rd loop again. If the base Q != 0
but the segment Q == 0 (lossless) this can could give rise
to a situation where we have an illegal combination of
transform size and Q. (Q == 0 requires that all blocks
are coded 4x4 WHT).
Change-Id: I241a58c6494ed442e9e4630070b0cde0fb99ae45
As a side-effect, the sad unit tests for VP8 and VP9
had to be separated.
Fixes a bug in original patch:
(https://gerrit.chromium.org/gerrit/#/c/70163/8)
that was reverted due to a nightly test failure.
Change-Id: Ia2a4e9e278fd3c89d6c3c82fcc6381320ec2a8a6
In frame parallel decoding mode, there will be still several frames inside
the decoder when application stop calling vpx_codec_decode to decode frames.
The application will need to keep calling vpx_codec_get_frame to get all the
remaining decoded frames in the decoder.
Change-Id: I2ce8260a91282f045bb9a6093ff8a606b1990f14
This commit added a call to set speed feature before initializing
motion search, fixed the problem where unintialized search method
is used before its value being set.
Change-Id: I537e4612bf0d00fd6f51396fd222d4b3bd6fde58
Making this consistent with intra mode masks: you need to specify
allowed inter/intra modes to use.
Change-Id: Iaecd28bf79047259707d8e7a59a57bb7b856383e
SEG_LEVEL_SKIP requires the block size to be at least 8x8. Attempting to
use it on smaller partitions causes the decoder to reject the bitstream.
Change-Id: Ia7188cdf8ae5ac1df6bd29f3f80dbb0610e1f7b1
An overflow issue could potentially happen in the second round 1-D
transform of the SSSE3 full inverse 16x16 2D-DCT. This commit fixes
this issue.
Change-Id: Ia19e4888fda1cc929a28a5f89a5beec612d628dc
This commit enables SSSE3 implementation of the inverse 2D-DCT
with only first 10 coefficients non-zero. It reduces the runtime
of SSE2 version from 745 cycles to 538 cycles, i.e., 27% speed-up.
Change-Id: I18ba4128859b09c704a6ee361d69a86c09fe8dfe
This code dates from the ancient past and
applied an error score weighting based on pixel
brightness. This not seem to be providing any
benefit metrics wise and could be making some
visual issues in dark frames worse.
The field is left in place in the FIRSTPASS_STATS data
structure in this patch, pending changes to unit tests that
use a pre-defined first pass file.
Change-Id: Id50f04205230234858e7548ce523f11acaf3567d
The subpixel SSSE3 was fixed in this patch:
https://gerrit.chromium.org/gerrit/#/c/70283/
So the equivalent AVX2 is fixed accordingly.
Change-Id: Ieebbc1949c99d34b12b8b47692df71aca5001f3a
This commit enables the SSSE3 implementation of full inverse 16x16
2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles,
about 7% speed-up.
Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d
Further changes to first pass allocation for gf/arf groups.
Three variables removed from TWO_PASS structure as only
now used locally. Dont adjust gf_group_bits in the post
encode update as this will no longer have any effect.
Change-Id: Iff89b225db923fc856f5d2aedbc899f1d7d68b55
In 8-tap filtering, to guarantee the intermediate results fit in
16 bits, the order of accumulating the products needs to be done
correctly, and the largest product should be added last. This
patch fixed the problem using the method in commit "Correct ssse3
8/16-pixel wide sub-pixel filter calculation".
Change-Id: I79d0ad60c057b15011ece84cda9648eee0809423
Restructuring to allocate the bits for each frame in
a GF group at the time the group is defined.
At the moment the allocation closely mirrors what
we had before.
Also changes the default rate adjustment method to
LONG_TERM_VBR_CORRECTION.
Change-Id: Ie5793c46c6b9c888cead5d8790792efd7d60b7c1
As mismatchs were found between the intrinsic version and c only. The
commit temporarily revert to use the matching assembly version to
allow further investigation.
Change-Id: I08436c47d4888b562c0eac8e8856d90a831442df
Use the appropriate subblock offset mode info rather than the parent
block base, when filling mbmi in the pc tree in nonrd_use_partition.
This mimics what is done in the vertical case and what is done for
both cases in nonrd_pick_partition.
This change has little practical effect at the moment since in speed 5
rt horizontal and vertical partitions are currently only used unpaired
at edges of the picture.
Change-Id: I4632f66ca84086dac56c7d36b45ddbe38a06f42a
This did the same correction as the one in commit "Correct ssse3
8/16-pixel wide sub-pixel filter calculation" to avoid saturation
during filtering.
Change-Id: Ife9aa3f62daf9114eb24fe38f7baa3c3f361b2d6
If we are already saving a lot in bits from the target (maximum)
bitrate in the constrained quality mode, allow the quantizer
to go lower than the cq level. This hopefully will solve issues
with getting too low a bitrate and consequently poor quality for
certain videos in cq mode.
Change-Id: I1c4e8b0171fcf58f95198b3add85eea5f3c8f19f
Renames all x86_64 specific assembly files to consistently
end in _x86_64.asm. This will be useful for build systems to
handle these files differently.
All new 64-bit specific assembly files should use the new
naming convention.
Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536
This commit changed to enable the encoder to adjust motion dection
speed threshold based on picture size. In addition, cpu-used 1 now
does a partition search every other frame instead of every third
frame for low resolution inputs.
The change has no quality/speed impact for 720p and above. Test
showed the change increase encoding time by between 3% to 6% for
cpu-used 2 encodiong of 360p sequences. It also has a compression
gain about .3%.
For cpu-used 2, the change resolved some very disturbing visual
artifacts in certain sequences when large block partitionings and
transforms are used as a result of copying the partition from a
previous frame.
Change-Id: Ic7fd22508cdb811d4ca935655adbf20109286cfa
The final goal is eventually to get rid of both itxm_add and fwd_txm4x4.
This patch does it in the decoder.
Change-Id: Ibb3db57efbcbb1ac387c6742538a9fcf2c6f24a5
The current decode_tiles decodes the frame one tile by one tile
and then loopfilter the whole frame or use another worker thread to
do loopfiltering.
|------|------|------|------|
|Tile1-|Tile2-|Tile3-|Tile4-|
|------|------|------|------|
For example, if a tile video has one row and four cols, decode_tiles
will decode the Tile1, then Tile2, then Tile3, then Tile4.
And during decode each tile, decode_tile will decode row by row in
each tile.
For frame parallel decoding, decode_tiles will decode video in row order
across the tiles. So the order will be:
"Decode 1st row of Tile1" -> "Decode 1st row of Tile2"
-> "Decode 1st row of Tile3" -> "Decode 1st row of Tile4"
-> "Decode 2nd row of Tile1" -> "Decode 2nd row of Tile2"
-> "Decode 2nd row of Tile3" -> "Decode 2nd row of Tile4"-> "loopfilter 1st row"
Change-Id: I2211f9adc6d142fbf411d491031203cb8a6dbf6b
This commit adjusts the forward 16x16 DCT computation steps to
simplify the register level operations. It fixes the corresponding
sse2 version accordingly.
Change-Id: I72a9c25b8ca9442fc5e113f47cd701ae55aa7f08
Added a skipping test in non-rd inter-mode. After interpolation
prediction step, the residuals are tested to see if they will be
quantized to 0 based on modeling between spatial domain and
frequency domain.
Set static-thresh to 800 for >=720p and 300 for <720p, rtc set
tests showed
1. Speed 5, psnr: -0.514%; ssim: -1.748%;
speedup on related clips: 5% -11%
2. Speed 6, psbr: -0.628%; ssim: -1.637%;
speedup on related clips: 4% - 9%
Change-Id: I62fbf26bc043ecd2b584f255f1a4ee5ab52bfcf3