This commit integrates the non-RD mode decision process and the
encoding process into a single recursion scheme.
Change-Id: I6a7e72a0b84d567554801ebbe01ec75d54c1f77d
This patch was to fix the vpxdec fuzzing3 test failure. When an
error occurs, setjmp() is invoked, which calls the decoder
removing routine. In multiple thread situation, other threads
could try to access the frame context memory that is already
deallocated, thus causing a segfault.
An invalid unit test was added for this issue.
Change-Id: Ida7442154f3d89759483f0f4fe0324041fffb952
The aim of this patch is to apply a positive weighting to
frames that have a significant number of blocks that are
of low spatial complexity and are dark. The rationale behind
this is that artifacts tend to be more visible in such frames.
In this patch the weight is only applied in regard to the distribution
of bits between frames. Hence if all the frames share similar
characteristics (as is the case for most of our short test clips) there
will be little or no net effect.
However, the effect can be seen on some longer form test content.
For example Tears of steel baseline test:
2323.09 Kbit/s opsnr 39.915 ssim 74.729
With this patch:-
2213.34 Kbit/s opsnr 39.963 ssim 74.808
(Sligtly better metrics and about 5% smaller)
The weighting may well need some further tuning along side changes
to the aq modes.
Change-Id: Ieced379bca03938166ab87b2b97f55d94948904c
This commit removes the cyclic aq mode dependency on
in_static_area and reworks the corresponding cut-off thresholds.
It improves the compression performance of speed -5 by 1.47% in
PSNR and 2.07% in SSIM, and the compression performance of speed
-6 by 3.10% in PSNR and 5.25% in SSIM. Speed wise, about 1% faster
in both settings at high bit-rates.
Change-Id: I1ffc775afdc047964448d9dff5751491ba4ff4a9
This will save the memory and improve the decode speed due to
removing unnecessary memset of big prev_mi array for
all the key frames.
Decoding a all key frames 1080p video shows speed improve around 2%.
Change-Id: I6284a445c1291056e3c15135c3c20d502f791c10
This commit makes the RTC coding mode to conditionally skip the
reference frame mode search, when the predicted motion vector of
the current reference frame gives more than two times sum of
absolute difference compared to that of other reference frames.
It reduces the runtim by 1% - 4% for speed -5 and -6. The average
compression performance is improved by about 0.1% in both settings.
It is of particular benefit to light change scenarios. The
compression performance of test clip mmmovingvga.y4m is improved by
6.39% and 15.69% at high bit rates for speed -5 and -6, respectively.
Speed -5
vidyo1 16555 b/f, 40.818 dB, 12422 ms ->
16552 b/f, 40.804 dB, 12100 ms
nik 33211 b/f, 39.138 dB, 11341 ms ->
33228 b/f, 39.139 dB, 11023 ms
mmmoving 33263 b/f, 40.935 dB, 13508 ms ->
33256 b/f, 41.068 dB, 12861 ms
Speed -6
vidyo1 16541 b/f, 40.227 dB, 8437 ms ->
16540 b/f, 40.220 dB, 8216 ms
nik 33272 b/f, 38.399 dB, 7610 ms ->
33267 b/f, 38.414 dB, 7490 ms
mmmoving 33255 b/f, 40.555 dB, 7523 ms ->
33257 b/f, 40.975 dB, 7493 ms
Change-Id: Id2aef76ef74a3cba5e9a82a83b792144948c6a91
This commit unfolds the legacy macro definitions used in the
sub-pixel motion search and refactors the operational flow for
later optimizations.
Change-Id: I3e3f770cad961d03d1a6eb0b2a0186cc77eaf2b8
The current logic was allowing for disabling golden refresh only
for two pass svc encoding. This change disables it as long as
more than 1 layer encoding is used (for example temporal layers under 1pass CBR).
Change-Id: I4dc5204a7ad365c821ec7963e93b59da82e1826b
In the function mb_lpf_horizontal_edge_w_avx2_16 the usage of the intrinsic
_mm256_cvtepu8_epi16 cause a compiler bug in gcc 4.9.1.
until it will be fixed I created a workaround that create the up convert by
using broadcast128+shuffle.
The bug was reported here:
https://code.google.com/p/webm/issues/detail?id=867
Change-Id: I73452e6806f42e0fadcde96b804ea3afa7eeb351
A recent change has introduced big quality drops for speed 7 and 12
for --rt mode. The change reverted the big drop and improved quality
by 9.5% for speed 7 and 13.4% for speed 12.
Change-Id: I07b82e3bb6002a73af486a083458c88877bdad01
This will save a lot of memory for decoder due to removing of prev_mi,
but prev_mi is still needed in encoder. So this will increase a little bit
memory for encoder.
Change-Id: I24b2f1a423ebffa55a9bd2fcee1077dac995b2ed
This commit makes the inter prediction buffer system to support
hybrid partition search. It reduces the runtime of speed -5 by
about 3%. No compression performance change.
vidyo1 720p 1000 kbps
11831 ms -> 11497 ms
nik 720p 1000 kbps
10919 ms -> 10645 ms
Change-Id: I5b2da747c6395c253cd074d3907f5402e1840c36
Combined vp9_denoiser_8xM_sse2 and vp9_denoiser_4xM_sse2 into one
function vp9_denoiser_NxM_sse2_small and passed the bitexact testing.
Changed the name of the function vp9_denoiser_64_32_16xM_sse2 to
vp9_denoiser_NxM_sse2_big.
Change-Id: Ib22478df585994dd347ebae04202c0b701e7f451
This commit changes to allow the usage of golden reference frame in
VP9 CBR mode to improve quality. VP9 supports potentially up to 8
reference buffers, it has reference buffers available for this
purpose. This was not possible in VP8 as golden and alt-ref buffers
were used for temporal scalability purpose in CBR mode in WebRTC.
For frames that update golden frame, there can be a quality boost.
The amount of allowed bitrate boost can be controlled via parameter
rc_max_inter_bitrate_pct. The inital value of the boost ratior is
currently based on over_shoot_pct. Further experiments will work
out the adaption of this boost value.
Change-Id: I0c5f010c8fd8b7b598f69779c1b30e5b2ac30a4d
Added code to relax the active maximum Q in response
to extreme local overshoot to reduce bandwidth peaks.
The impact is small in metrics terms, but it this helps reduce
bandwidth spikes and overall overshoot in a number of
clips in our tests sets (especially the YT test set).
In particular this should help prevent very big spikes where a clip
is mainly easy but has a short hard section. In such a case a choice
of maximum Q for the clip as a whole may allow us to hit the overall
target rate but give some extreme spikes. The chunked encoding in YT
mitigates this problem but it can show up where a longer clip is
coded as a single chunk.
Change-Id: I213d09950ccb8489d10adf00fda1e53235b39203
The zero motion vector was effectively used in the subsampled pixel
based variance calculation. This commit makes it directly use zero
mv to generate prediction.
Change-Id: Ica83dc843e9f8da2f89c3ef451e50f16214c0def
0 means that golden boost is off, and uses average frame target rate,
a non-zero number means the percentage of boost over average frame
bitrate is given initially to golden frames in CBR mode.
Change-Id: If4334fe2cc424b65ae0cce27f71b5561bf1e577d
The point at which frames are scaled to their
coded dimensions is moved into the re-code loop.
This is in preparation for a further patch that
will add logic into the re-code loop to reduce
the coded frame size if the encoder is struggling
to hit the target data rate at the native frame
size.
Change-Id: Ie4131f5ec6fb93148879f6ce96123296442bf2d1
Add second level arf Q adjustment when using dual arfs
in constant Q mode.
Previously in constant Q mode enabling dual arf hurt by ~5%
but with this change the average benefit is ~1-1.5% with some
mid range data points up ~10%.
Note however that it still hurts on some clips including
some very low motion show content.
Change-Id: I5b7789a2f42a6127d9e801cc010c20a7113bdd9b
This patch allocated frame contexts outside VP9_COMMON. This allows
multiple threads to share the same copy of frame contexts, and
reduces the overhead. It also guarantees the correct update of
these contexts during bitstream packing. This patch doesn't change
encoding result.
Change-Id: Ic181a2460b891d1d587278a6d02d8057b9dbd353
The initialization of this_mode_pred does not work when the ref_frame
loop ever goes beyond LAST_FRAME. This commit fixes the subtle issue
and allows potentially expanding the loop to test GOLDEN_FRAME.
Change-Id: Ibbd427a22160d1d9eacb8ed0c87f88d6cef9c0f3
This commit refactors the rate distortion structure used in the
non-RD coding mode and saves a few RDCOST calculations.
Change-Id: I62c3416c300d2c5372f21b96d93a6b633a34ab3a
The existing speed features produce horrible encoding results, almost
30% worse than cpu-used=4, this commit adjust the speed features to
produce relatively resonable results to be within 3%-5% of cpu-used=4.
Change-Id: I0ca6ebafb33024d4a0cbcf04c78a4a00b8dd1ecf
Its functionality has been replaced with choose_partitioning and
threshold based control on split mode check.
Change-Id: Ic9bb321df06b524f5c38ea5874dc6f6a8f93c5e3
This speed feature has been deprecated in both yt and rtc coding
modes. This commit removes the related operations.
Change-Id: I079c79c6adafe45581af2ebf8b98faebcface1ce
This commit re-designs the recursive partition search scheme in
rtc speed -5. It first checks if the current block is under cyclic
refresh mode. If so, apply recursive partition search. Otherwise,
perform sub-sampled pixel based partition selection. When the
pre-selection finds the partition size should be 32x32 or above,
use the partition size directly. Otherwise, apply partition search
at nearby levels around the preset partition size.
It is enabled in speed -5. The compression performance of rtc
speed -5 is improved by 9.4%. Speed wise, the run-time goes slower
from 1% to 10%.
nik_720p, 1000 kbps
33220 b/f, 38.977 dB, 10109 ms -> 33200 b/f, 39.119 dB, 10210 ms
vidyo1_720p, 1000 kbps
16536 b/f, 40.495 dB, 10119 ms -> 16536 b/f, 40.827 dB, 11287 ms
Change-Id: I65adba352e3adc03bae50854ddaea1b421653c6c
Extend --auto-alt-ref from parameter so we can use it to
turn multi-arf on and off from the command line.
For now the range is 0-off, 1-on, 2-multi-arf on.
Rename play_alternate to enable_auto_arf
Change-Id: Id7b64407cfbe76ba0090a83b588a03e22a240386
All sad function that process above 32 consecutive elements are optimized
for AVX2:
vp9_sad64x64
vp9_sad64x32
vp9_sad32x64
vp9_sad32x32
vp9_sad32x16
vp9_sad64x64_avg
vp9_sad64x32_avg
vp9_sad32x64_avg
vp9_sad32x32_avg
vp9_sad32x16_avg
The functions that appeared as a hotspot is vp9_sad32x32 and vp9_sad64x64
vp9_sad32x32 was optimized by 68% and vp9_sad64x64 was optimized by 90%
both of them gave and overall ~2.3% user level gain
Change-Id: Iccf86b375a2b54c5fbbe685902ead0c9a561b9fd
Currently, the tokens for a tile are stored immediately after its
preceding tile, which causes a dependency. This is unnecessary
since we always allocate enough memory for tokens. Removing
the dependency allows token writing done in parallel. This patch
doesn't change encoding result.
Change-Id: I7365a6e5e2c2833eb14377c37e1503c9d0f26543
This should be set right after decoder really start to decode frame
instead setting at the end.
Even decoder does not have a displayable frame to show and return NULL
to application, this should be set too.
Change-Id: If0313a834bc64e3b0f05a84f4459d444d9eab0d8
When early termination is triggered, properly reset the rate cost
to invalid value to avoid potential ioc issue.
Change-Id: I3444390be2e49a34bb02cf8a74c33d5dbd96d88d
Delete gfboost_qadjust() and move Q based adjustment
into calc_frame_boost(). Also remove clamping. Making
the adjustment here means that it influences not just the
boost level but also the selection of the GF/ARF interval.
This change gives a small average gain in PSNR but
larger gains in SSIM, especially for harder std-hd set (1.5%)
Change-Id: I3aa81b8feccaeff93d915e19fb9cf5cd64c86327
This commit fixes an ioc issue that will happen when the cumulative
variables are not in effective use. The fix discards these
redundant additions.
Change-Id: Idbac5bfb989c0cedc5f8a323effce938519b2457
this removes an assumption that worker->data1 would be pointing to a
TileWorkerData allocation.
additionally, within the multi-threaded loopfilter pass VP9LfSync as a
parameter to the worker hook, removing the need for a shadow pointer in
LFWorkerData.
Change-Id: Ic7b2faa34e3eb59dbcb8a7c67f333448fa047c88
move them from VP9Worker::data[12] to allow the structure to be reused a
bit more naturally by the multi-threaded loopfilter.
Change-Id: I31b49c9e93ca744fd7f6d6ed8696671188fb2c1d
This removes an unnecessary restriction that causes
a problem (noticed by AWG) when the forced key frame
interval is set to a very small value, such as 10. In this case
we were being forced to code minimal length GF groups.
Change-Id: I76ef5861a09638ff51f61fea02359554184ada53
We encode a empty invisible frame in front of the base layer frame to
avoid using prev_mi. Since there's a restriction for reference frame
scaling factor, we have to make it smaller and smaller gradually until
its size is 16x16.
Change remerged.
Change-Id: I9efab38bba7da86e056fbe8f663e711c5df38449
This reverts commit 452dc21500.
This change has introduced a significant quality regression on content
with forced key frames. (e.g. the YT and yt-hd set). It is most
noticeable in static content where the kf bits dominate. Here, despite
key frames being apparently coded at the same Q, there is a drop in all
metrics of ~20% (e.g clXR and BFa0).
Change-Id: Iba14cc61778c0846fa0a59c33c55a9fc49512cb4
Compare the estimated rate and distortion to the thresholds scaled
according to the operating block size and determine if further
split partition search will be run. The compression performance of
speed -5 is changed by -0.074%. The encoding speed is 10% - 15%
faster.
vidyo1 720p
16545 b/f, 40.492 dB, 11475 ms -> 16535 b/f, 40.486 dB, 10100 ms
nik720p
16624 b/f, 36.310 dB, 10071 ms -> 16617 b/f, 36.313 dB, 8346 ms
Change-Id: Ic9197ab5761279ae55d2fb7813b2af0e0db497b8
Reduce the intra_cost_penalty for non-rd mode,
and some updates to VAR_BASED_PARTITION.
Visual tests show some improvement at Speed 6, for RTC clips.
Change-Id: If9090daf7aed14906a32d931a538ab544bbca606
This commit replaces the use of copy_partitioning with
choose_partitioning based on the sse of subsamped pixels, which
provides significantly better coding performance and runs at
similar speed, as compared to copy_partitioning. It improves rtc
speed 5 coding performance by 3%.
Change-Id: I52d3682a12dce0147f5e52383a594fc242ca3228
We encode a empty invisible frame in front of the base layer frame to
avoid using prev_mi. Since there's a restriction for reference frame
scaling factor, we have to make it smaller and smaller gradually until
its size is 16x16.
Change-Id: I60b680314e33a60b4093cafc296465ee18169c19
Move the point at which input frames are scaled
into the recode loop. This will allow us to change
the coded frame size dynamically in response
to previous attempts to encode the frame at a
higher resolution.
A following patch will implement a scheme for
resizing the frame in the recode loop.
Change-Id: I6a59c02d6ac1626512edad6de8b60063b79433e6
This commit makes a struct that contains rate value, distortion
value, and the rate-distortion cost. The goal is to provide a
better interface for rate-distortion related operation. It is
first used in rd_pick_partition and saves a few RDCOST calculations.
Change-Id: I1a6ab7b35282d3c80195af59b6810e577544691f
Add back clamp which ensures that the Q adaptation
is turned off when the over_shoot_pct and under_shoot_pct
parameters are set to 100.
Change-Id: Id0161b114d39a3029cd3eb28020caab0c3914922
Allow min and maxQ to creep when the undershoot
or overshoot exceeds thresholds controlled by the
command line under_shoot_pct and over_shoot_pct
values.
Default is 100%,100% which ~disables adaptation.
Derf results for example undershoot% / overshoot%:-
Head:- Mean abs (%rate error) = 14.4%
This check in:-
25%/25% - Mean abs (%rate error) = 6.7%
PSNR hit -1% SSIM -0.1%
5% / 5% - Mean abs (%rate error) = 2.2%
PSNR hit -3.3% SSIM - 1.1%
Most of the remaining error and most of the quality hit is
at extreme data rates. The adaptation code still has an
exception for material that is in effect static so that we
don't over adjust and over spend on YT slide show type
content.
(Rebase of If25a2449a415449c150acff23df713e9598d64c9
to resolve a auto-merge error)
Change-Id: Iec4e1613ef0d067454751d8220edb7058dfbd816
Allow min and maxQ to creep when the undershoot
or overshoot exceeds thresholds controlled by the
command line under_shoot_pct and over_shoot_pct
values.
Default is 100%,100% which ~disables adaptation.
Derf results for example undershoot% / overshoot%:-
Head:- Mean abs (%rate error) = 14.4%
This check in:-
25%/25% - Mean abs (%rate error) = 6.7%
PSNR hit -1% SSIM -0.1%
5% / 5% - Mean abs (%rate error) = 2.2%
PSNR hit -3.3% SSIM - 1.1%
Most of the remaining error and most of the quality hit is
at extreme data rates. The adaptation code still has an
exception for material that is in effect static so that we
don't over adjust and over spend on YT slide show type
content.
Change-Id: If25a2449a415449c150acff23df713e9598d64c9
Function will jump to error handler when ref buffer is corrupted.
So "xd->corrupted |= ref_buffer->buf->corrupted;" is useless.
Change-Id: I35353a0637ad0dbb682454e040ef69fa68280bfa
- Some fixes to surface fit.
- Returns variance function as cost rather than sad in the
pattern search and diamond search functions. Only
vp9_pattern_search_sad function used in bigdia search
uses sad as integer 1-away costs.
- Deploys SUBPEL_TREE_PRUNED_MORE for speed 4+.
Results:
derf [Speed 3]: About +0.036% in coding efficiency without any
discernible speed loss.
derf [Speed 4]: About 2-3% faster at -0.199% loss in coding efficiency.
derf [Speed 5]: About 3-4% faster at -0.149% loss in coding efficiency.
Change-Id: I8462f94f6adb46966ca964f2bd0400977357fd63
In model_rd_for_sb function, the spatial domain SSE and variance
are checked to see if transform coefficients are quantized to 0.
Besides that, this patch adds another set of thresholds that are
much more strict. These thresholds are used to conduct a partition
block level check to measure if all its TX blocks are skippable
for YUV planes. If it is true, x->skip is set for this partition
block, and thus its mode search is terminated.
This speeds up the encoding at very low prediction error case,
such as screen sharing application. This patch covers what
rd_encode_breakout_test() does, so that function is removed.
Borg test at speed 3 shows:
For stdhd set, psnr: +0.008%, ssim: +0.014%;
For derf set, psnr: +0.018%, ssim: +0.025%.
No noticeable speed change.
Change-Id: I4e5f15cf10016a282a68e35175ff854b28195944
For input source with size that is not multiple of 8, the size is
rounded to 8 and saved in width or height, the original source sizes
are saved in crop_width and crop_height. This commit corrects the
computation of bottom and right extension amounts to use the orignal
sizes, hence crop_width and crop_height.
In addition, this commit also adds the missed initialization for
uv_crop_width and uv_crop_height.
This addresses issue #834
Change-Id: I084543ca7645a4964b88f7cf8ff668f517d3a39b
The concept:
There's too much noise in source pixels for variance and at low bitrate
the reconstructed looks nothing like the source so we have problems
getting good partitionings with either. This skirts the issue by using
a box blur scaled down version for variance calculations. To compare
against source_var_ moved keyframe to be rd based like source_var.
Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624
This commit breaks the overly broad header files into more
targeted and smaller ones, to help better structure the system
layout.
Change-Id: I7b24559d3ea6e582cf5d9bbe8f71459f9824d71b
Fixed an encoder crash. Set skip_txfm to 0 for cases that skip_txfm
isn't calculated. Put memcpy of skip_txfm at right place.
Change-Id: Ib3b6afc1b251a85b2a853c8138fb3393f48cfef6
The functions b_width_log2 and b_height_log2 only do direct
table fetch. This commit unifies such use cases by using the
table directly and removes these functions.
Change-Id: I3103fc6ba959c1182886a2799d21b8b77c8a7b6b
Add comments on the use case of these definitions. Further reduce
the scope of header file in vp9_context_tree.h.
Change-Id: Ic4a7638e838d0ac441b64abfc56e57354c059d75
The coefficient range checking is enabled when configured with
--enable-debug --enable-coefficient-range-checking
for vpxdec to detect ill-formed input stream. This addresses the
problem raised by issue #792.
Change-Id: I3f9ea541de4dc742dd64389d6c5f543fb1c4f052
This commit fixes a buffer pointer mis-use in store_coding_context.
The compression performance for stdhd set of speed 3 is improved by
0.097%. It fixes issue 869.
Change-Id: Idc59e22035eaf39f7133ca04174894374d647ff7
This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are
only 16x16 blocks in denoiser, while in VP9, there are 13 different
block sizes.
By adding this SSE2 code, the improvement of encoder speed is around
20%(using C code vs using SSE2 code), vary for different clips.
The unit test for VP9 denoiser is to confirm that the SSE2 code is
bit-exact with the C code. The unit test covers all block size.
Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d
Adjustments to the GF interval choice and minimum boost.
Adjustment to the calculation of 2 pass worst q.
Compared to 09/29 head there is metrics hit on derf of
(-0.123%,-0.191%)
Compared to the September 29 head and a baseline on
September 18 baseline the accuracy of the VBR rate control
measured on the derf set is as follows:-
Mean error % / Mean abs(error %)
Sept 18 baseline (-7.0% / 14.76%)
Sept 29 head (-15.7%, 19.8%)
This check in (-1.5% / 14.4%)
The mean undershoot is reduced slightly but the
worst case overshoot on e.g. harbour/highway is
increased. This will be addressed in a later patch.
Change-Id: Iffd9b0ab7432a131c98fbaaa82d1e5b40be72b58
It is possible that the GOLDEN reference frame is not avaiable, in
which setting the predicted mv will be associated with a residual
value of INT_MAX. This commit checks this condition before
left shift and comparison with that of ALTREF frame, to avoid
overflow issue.
Change-Id: Ib98c3149dbdd016f2fe5beaafb13f67d469dd07c
This commit adds proper initialization of segment id for variance AQ
mode in non-rd coding path. It fixes the enc/dec mismatch issue of
rt=7 with --aq-mode=1, as reported in issue #816
Change-Id: I02fa41b96345bf2e66077d5ea553f85ba800f7bb
This commit enables the encoder to skip split partition search if
the bigger block size has all non-zero quantized coefficients in low
frequency area and the total rate cost is below a certain threshold.
It logarithmatically scales the rate threshold according to the
current block size. For speed 3, the compression performance loss:
derf -0.093%
stdhd -0.066%
Local experiments show 4% - 20% encoding speed-up for speed 3.
blue_sky_1080p, 1500 kbps
51051 b/f, 35.891 dB, 67236 ms ->
50554 b/f, 35.857 dB, 59270 ms (12% speed-up)
old_town_cross_720p, 1500 kbps
14431 b/f, 36.249 dB, 57687 ms ->
14108 b/f, 36.172 dB, 46586 ms (19% speed-up)
pedestrian_area_1080p, 1500 kbps
50812 b/f, 40.124 dB, 100439 ms ->
50755 b/f, 40.118 dB, 96549 ms (4% speed-up)
mobile_calendar_720p, 1000 kbps
10352 b/f, 35.055 dB, 51837 ms ->
10172 b/f, 35.003 dB, 44076 ms (15% speed-up)
Change-Id: I412e34db49060775b3b89ba1738522317c3239c8
Incorporates the WRAPLOW macro into the non-highbitdepth transforms
to aid hardware verification between a software C model and an
intended hardware implementation though the use of the configure
options: --enable-experimental --enable-emulate-hardware.
Note that to avoid further discrepancies between the sse/sse2
implementations of the transforms and the C implementation, when the
emulate hardware option is invoked, we also disable sse/sse2/etc.
Also incudes some minor cleanups/renaming etc.
Change-Id: Ib864d8493313927d429cce402982f1c8e45b3287