Both first pass and mbgraph search use block size 16x16 for motion
estimation. This commit put a limit of motion vector range. The
effective range allows the entire 16x16 with required subpel
interpolation input to be completely outside image border, but
not any further away from image border.
Change-Id: Id70a5ed08be49e70959f064859d72adc7d775d08
INT64_MAX may be assigned as RDCOST when RDCSOST computation is skipped
for speed, this commit to prevent INT64_MAX from being used as real
RDCOST in transform size decision.
Change-Id: I89a945134191bbdea1f1431ade70424ac079eaac
After change of MI context storage , mi_8x8[] pointer may be null for
a block outside of image border. The commit changes to access the data
only after validation of mi_row and mi_col.
Change-Id: I039c4eb486a228ea9d8e5f35ab9ae6717d718bf3
39c7b01d accidently reverted the row/col initialization, which broke
mv clamps, which is dependent on the sites for valid motion vector
range. This commit fixed the issue.
Change-Id: Ibcce0226e0360b1ef483fe760b2e33f1af4bf494
This commit enables forcing all coefficients zero per transformed
block, when its rate-distortion cost is lower than regular coeff
quantization.
The overall performance improvement (including its parent patch on
calculating rd cost per transformed block) at speed 1:
derf: 0.298%
yt: 0.452%
hd: 0.741%
stdhd: 0.006%
Change-Id: I66005fe0fd7af192c3eba32e02fd6d77952accb5
Adds modeled functions to decide the qp for altref frames in constant q
mode similar to other functions in use in bitrate mode.
Also turns on the constrained quality mode (end-usage=2) option which
was turned off before. Basic testing shows the mode works in principle,
to cap bitrate to the target-bitrate specified, while allowing lower
bitrate depending on the cq-level specified. The mode will need to be
improved over time.
Results for constant quality vs bitrate control mode:
derfraw300/fullderfraw: +3.0% at constant quality over bitrate control.
fullstdhdraw: +4.341%
stdhdraw250: +5.361%
Change-Id: If5027c9ec66c8e88d33e47062c6cb84a07b1cda9
This commit makes the rate-distortion optimization loop evaluate
the rd costs of regular quantization and all zero coeffs, per
transformed block. It improves speed 1 compression performance:
derf: 0.245%
yt: 0.515%
For a large partition that consists multiple transformed blocks,
this allows more flexibility to selectively force a portion of
them coded as all zero coeffs, as well be continued in the next
patches.
Change-Id: I211518be4179747b57375696f017d1160cc91851
The sub8x8 blocks has its own motion vector reference scheme. The
mv_pred is only used blocks of sizes 8x8 and above, to find the
starting point for motion search.
This change does not change any coding behavior. It makes the
encoding process slightly faster. (0.5% speed-up for local test on
speed 1.)
Change-Id: I746ee6ef0eac19aa3621be014afa12be8d82cbb9
The fake token EOSB may cause invaild memory read in pack token, this
commit reworked the loop to avoid such invalid read.
Change-Id: I37fdfce869b44a7f90003f82a02f84c45472a457
This commit cleans up the second reference check in the
rate-distortion optimization loop of sub8x8 blocks.
Change-Id: Ife68feaa4cddbfad2878c9b44d3012788d634f97
The commit added reset of pred_mv at the beginning of each SB64x64
partition mv search, also limited the usage of pred_mv only when
search on the largest partition is already done. This is to fix
a crash at speed 1/2 encoder where an invalid mv is used in mv
search.
Change-Id: I39010177da76d054e3c90b7899a44feb2e3a5b1b
This commit enables adaptive constraint on motion search range for
smaller partitions, given the motion vectors of collocated larger
partition as a candidate initial search point.
It makes speed 0 runtime of bus at CIF and 2000 kbps goes from
167s down to 162s (3% speed-up), at 0.01dB performance gains. In
the settings of speed 1, this makes the runtime goes from 33687 ms
to 32142 ms (4.5% speed-up), at 0.03dB performance gains.
Compression performance wise, it gains at speed 1:
derf 0.118%
yt 0.237%
hd 0.203%
stdhd 0.438%
Change-Id: Ic8b34c67810d9504a9579bef2825d3fa54b69454
Mis-merge of the following change managed to break mode order
and delete two mode options (new alt ref and near alt ref)
It also created a situation where we could test two undefined
modes off the end of the VP9_mode_order[] data structure.
"clang warnings : remove split and i4x4_pred fake modes"
"Change Id: I8ef3c*"
Initial testing on Akiyo at speed 2.
101.35 44.567 44.447 improves to
96.82 44.915 44.815
Approx 0.3-0.4db gain and 2.5% size reduction
Change-Id: Icff813e7c0778d140ad4f0eea18cf1ed203c4e34
Removes this speed feature since it is very slow and unlikely
to be used in practice. This cleanup removes a bunch of unnecessary
complications in the outer encode loop.
Change-Id: I3c66ef1ca924fbfad7dadff297c9e7f652d308a1
Propose some changes to the speed 2 settings to improve quality.
In particular, turns off the adjust_thresholds_by_speed feature
which improves results by 6%. Also removes the code for
adjust_thresholds_by_speed since it conflicts with the adaptive
rd thresh feature.
Overall, with this change speed 2 is -15.2% from speed 0 settings,
on derf, which is significantly better than -21.6% down before.
Change-Id: I6e90a563470979eb0c258ec32d6183ed7ce9a505
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of pointers to MODE_INFO structs. The
MODE_INFO structs are now stored as a stream (decoder only),
eliminating unnecessary copies and is a little more cache
friendly.
Change-Id: I031d376284c6eb98a38ad5595b797f048a6cfc0d
The c code implementation of 32x32 quantization does the zbin check
of all coefficients prior to the quant/dequant loop, hence removing
the redundant zbin check inside the loop. This only affects the
c code version. SSSE3 version does not separate the zbin check out.
Change-Id: Ic197a7d61d0b25fcac3cc092987651378cb56e4e
If the current obtained distortion is very small, which happens
for static image case, we pick the current partition type without
further split checking.
This won't affect regular videos. For static videos, we got 10%~12%
encoding speed gain. PSNR was better for some clips, and worse for
others. Overall it was even.
Change-Id: If787a57bedf46fc595ca4f5ded2b0c0a69e9fdef
Thank Paul for the suggestions. While turning on static-thresh
for static-image videos, a big jump on bitrate was seen. In this
patch, we detected static frames in the video using first-pass
stats. For different cases, disable encode breakout or reduce
encode breakout threshold to limit the skipping.
More modification need be done to break incorrect partition
picking pattern for static frames while skipping happens.
Change-Id: Ia25f47041af0f04e229c70a0185e12b0ffa6047f
A previous speed feature skipped modes not used in earlier
partitions but this not longer worked as intended following
changes to the partition coding order and in conjunction
with some other speed features (Especially speed 2 and above).
This modified mode skip feature sets a mask after the first X
modes have been tested in each partition depending on the
reference frame of the current best case.
This patch also makes some changes to the order modes are
tested to fit better with this skip functionality.
Initial testing suggests speed and rd hit count improvements
of up to 20% at speed 1. Quality results. (derf -1.9%, std hd +0.23%).
Change-Id: Idd8efa656cbc0c28f06d09690984c1f18b1115e1
Sample app: vp9_spatial_scalable_encoder
vpx_codec_control extensions:
VP9E_SET_SVC
VP9E_SET_WIDTH, VP9E_SET_HEIGHT, VP9E_SET_LAYER
VP9E_SET_MIN_Q, VP9E_SET_MAX_Q
expanded buffer size for vp9_convolve
modified setting of initial width in vp9_onyx_if.c so that layer size
can be set prior to initial encode
Default number of layers set to 3 (VPX_SS_DEFAULT_LAYERS)
Number of layers set explicitly in vpx_codec_enc_cfg.ss_number_layers
Change-Id: I2c7a6fe6d665113671337032f7ad032430ac4197
The 16x16 transform unit test suggested that the peak coefficient
value can reach 32639. This could cause potential overflow issue
in the SSSE3 implmentation of 16x16 block quantization. This commit
fixes this issue by replacing addition with saturated addition.
Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
Adds a new end-usage option for constant quality encoding in vpx. This
first version implemented for VP9, encodes all regular inter frames
using the quality specified in the --cq-level= option, while encoding
all key frames and golden/altref frames at a quality better than that.
The current performance on derfraw300 is +0.910% up from bitrate control,
but achieved without multiple recode loops per frame.
The decision for qp for each altref/golden/key frame will be improved
in subsequent patches based on better use of stats from the first pass.
Further, the qp for regular inter frames may also be varied around the
provided cq-level.
Change-Id: I6c4a2a68563679d60e0616ebcb11698578615fb3
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of a pointer to a MODE_INFO struct and
a "in the image" flag. The MODE_INFO structs are now stored
as a stream, eliminating unnecessary copies and is a little
more cache friendly.
For the test clips used, the decoder performance improved
by ~4.3% (1080p) and ~9.7% (720p).
Patch Set 2: Re-encoded clips with latest. Now ~1.7% (1080p)
and 5.9% (720p).
Change-Id: I846f29e88610fce2523ca697a9a9ef2a182e9256
The 32x32 forward transform can potentially reach peak coefficient
value close to 32700, while the rounding factor can go upto 610.
This could cause overflow issue in the SSSE3 implementation of 32x32
quantization process.
This commit resolves this issue by replacing the addition operations
with saturated addition operations in 32x32 block quantization.
Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
Moves counting of mv branches to where we have a new mv, instead of after
the whole frame is summed.
Change-Id: I945d9f6d9199ba2443fe816c92d5849340d17bbd
Speed 4 fixed partition size. Use fixed size unless it does not
fit inside image, in which case use the largest size that does.
Change-Id: I250f7a80506750dd82ab355721624a1344247223
This commit fixed the potential overflow issue in the SSE2
implementation of 32x32 forward DCT. It resolved the corrupted
coded frames in the border of scenes.
Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
While static-thresh is on, we only need to transmit skip
flag if skip = 1. The cost of skip bit is added to the
total rate cost.
Change-Id: I64e73e482bc297eba22907026298a15fa8cc3920
Added some code to output normalized rd hit count stats.
In effect this approximates to the average number of rd
operations/tests per pixel for the sequence.
The results are not quite accurate and I have not bothered
to account for partial SB64s at frame edges and for key frames
However they do give some idea of the number of modes /
prediction methods being tested for each pixel across the
different partition sizes. This indicates how much scope their
is for further gains either by reducing the number of partitions
examined or the modes per partition through heuristics.
Patch 3 moved place where count incremented so partial rd
tests that are aborted with INT_MAX return are also counted.
Example numbers for first 50 frames of Akiyo.
Speed 0 ~84.4 rd operations / pixel
Speed 1 ~28.8
Speed 2 ~11.9
Change-Id: Ib956e787e12f7fa8b12d3a1a2f6cda19a65a6cb8
The 32x32 quantization process can potentially have the intermediate
stacks over 16-bit range, thereby causing enc/dec mismatch. This commit
fixes this overflow issue in the SSSE3 implementation, as well as the
prototype, of 32x32 quantization.
This fixes issue 607 from webm@googlecode.
Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
The two arrays are typically initialized to INT64_MAX, if they are not
filled with valid values before the addition, the values can overflow
and lead to wrong results.
Change-Id: I515de22cf3e8f55af4b74bdb2c8eb821a02d3059
Switching from mi_{width, height}_log2 and b_{width, height}_log2 to
num_8x8_blocks_{wide, high} and num_4x4_blocks_{wide, high}. Removing
redundant code, adding const.
Change-Id: Iaab2207590fd24d0b76999071778d1395dc5cd5d
Incorporates a speed feature for fast forward updates of
coefficients. This feature takes 3 values:
0 - use standard 2-loop version
1 - use a 1-loop version
2 - use a 1-loop version with reduced updates
Results: derfraw300 +0.007% (on speed 0) at feature value = 1
-0.160% (on speed 0) at feature value = 2
There is substantial speed up at speeds 2 and above for low
resolution sequences where the entropy updates are a big part
of the overall computations.
Change-Id: Ie96fc50777088a5bd441288bca6111e43d03bcae
This commit resolved a mis-alignment issue in compound inter-inter
prediction of sub8x8. This patch follows solution from dkovalev@.
Change-Id: I3cc0cf7e55b84110e0c42ef4b2e6ca7ac3f8f932
In subpel_avg_variance functions, code similar to the following
punpkldq m2, [addr]
actually reads 8 bytes. For functions that are supposed to work on
buffers only have less 8 bytes a line, this caused valgrind error
of reading uninitialized memory.
Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
Removing references to plane_block_width and plane_block_height (we are
going to delete the latter ones).
Change-Id: I7982da4d373aebb54d2209dc8886f6192df4d287
Previous change c4048dbd limits the mv search range assuming max block
size of 64x64, this commit change the search range using actual block
size instead.
Change-Id: Ibe07ab02b62bf64bd9f8675d2b997af20a2c7e11
Making code more compact, adding consts, removing redundant arguments,
adding do/while(0) for macros.
Change-Id: Ic9ec0bc58cee0910a5450b7fb8cfbf35fa9d0d16
To the source buffer to be encoded as an alt ref frame. This is to fix
the problem of using uninitialized memory in encoder.
See https://code.google.com/p/webm/issues/detail?id=605
Change-Id: I97618a2fc207e08abcf5301b734aa9e3ad695e2c
Values now carried over frame to frame.
Change to algorithm for decreasing threshold after
a hit and to max threshold (now based on speed)
Removed some old commented out code relating to
VP8 adaptive thresholds.
The impact of these changes tested on Akiyo (50 frames)
and measured in terms of unit rd hits is as follows:
Speed 0 84.36 -> 84.67
Speed 1 29.48 -> 22.22
Speed 2 11.76 -> 8.21
Speed 3 12.32 -> 7.21
Encode speed impact is broadly in line with these.
Change-Id: I5b886efee3077a11553fa950d796fd6d00c8cb19
Most of the focus so far has been on inter frames.
At high speed settings the key frame is now taking a high %
of the cycles.
This patch puts in some masking to reduce the number
of INTRA modes searched during key frame coding (as already
happens for inter frames) at higher speed settings
TODO: Develop this further with either adaptive rd thresholds
when choosing which intra modes to consider or some other
heuristic.
Impact.
At high speed settings on some clips the key frame was starting
to dominate. In a coding of the first 50 frames of AKIYO at speed
2 limiting the key frame intra modes to DC or TM_PRED resulted in
~30% overall speedup. For Bus the number was lower at ~4-5%.
Change-Id: I7bde68aee04995f9d9beb13a1902143112e341e2
Put rectangular partition check flag change according to the rd
costs of NONE and SPLIT partition types under the speed feature.
Change-Id: If681e1e078a8d43d86961ea4b748da5cd1b6c331
It is possible to have invalid scale factors and not access them
during decoding. Error is reported if we really try to use invalid scale
factors.
Change-Id: Ie532d3ea7325ee0c7a6ada08269f804350c80fdf
remove duplicate allocation from vp9_create_compressor, it was added to
vp9_alloc_frame_buffers in:
d5bec52 Added resizing & initialization of last frame segment map
Change-Id: I996723226a16a62aff8f9a52ac74e0b73cc98fdf
This commit changes the partition search order of superblocks from
{SPLIT, NONE, HORZ, VERT} to {NONE, SPLIT, HORZ, VERT} for
consistency with that of sub8x8 partition search. It enable the use
of early termination in partition search for all block sizes.
For ped_area_1080p 50 frames coded at 4000 kbps, it makes the runtime
goes down from 844305ms -> 818003ms (3% speed-up) at speed 0.
This will further move towards making the in-search partition types
configurable, hence unifying various speed-up approaches.
Some speed 1 and 2 features are turned off during the refactoring
process, including:
disable_split_var_thresh
using_small_partition_info
Stricter constraints are applied to use_square_partition_only for
right/bottom boundary blocks. Will bring back/refine these features
subsequently. At this point, it makes derf set at speed 1 about
0.45% higher in compression performance, and 9% down in run-time.
Change-Id: I3db9f9d1d1a0d6cbe2e50e49bd9eda1cf705f37c
Adds a couple of minor fixes, which may be absorbed in Jingning's
patch. Thanks to Guillaume for pointing these out.
Also adjusts the thresholds for speed 1 and 2 to 16 and 32
respectively, to keep quality drops small.
Results:
--------
derfraw300: threshold = 16, psnr -0.082%, speedup 2-3%
threshold = 32, psnr -0.218%, speedup 5-6%
stdhdraw250: threshold = 16, psnr -0.031%, speedup 2-3%
threshold = 32, psnr -0.273%, speedup 5-6%
Change-Id: I4b11ae8296cca6c2a9f644be7e40de7c423b8330
It appears that the above/left mb_skip_coeff used during
the pick modes, is left over from the previously
encode frame. This patch initializes the flag to the default
value of zero.
Change-Id: Ida4684cc99611d6e3e82628db35ed717e28ce550
Adding set_contexts contexts function and call it instead of
set_contexts_on_border. Calling txfrm_block_to_raster_xy to get aoff and
loff.
Change-Id: I41897e344afd2cae1f923f4fdbe63daccf6fe80e
Check the minimum rate-distortion cost of regular quantization and
all zero coeffs cases in the sub8x8 inter prediction rd loop for
luma components. Use this as the cumulative rdcost sent to UV rd
estimation.
Change-Id: Ia4bc7700437d5e13d7cdad4cf9ae57ab036d3e97
Cleans up the switchable filter search logic. Also adds a
speed feature - a variance threshold - to disable filter search
if source variance is lower than this value.
Results: derfraw300
threshold = 16, psnr -0.238%, 4-5% speedup (tested on football)
threshold = 32, psnr -0.381%, 8-9% speedup (tested on football)
threshold = 64, psnr -0.611%, 12-13% speedup (tested on football)
threshold = 96, psnr -0.804%, 16-17% speedup (tested on football)
Based on these results, the threshold is chosen as 16 for speed 1,
32 for speed 2, 64 for speed 3 and 96 for speed 4.
Change-Id: Ib630d39192773b1983d3d349b97973768e170c04
Changes to code to auto select a partition size range
based on data from spatial neighbors.
Now looks at the sb_type in each 8x8 block of above
and left SB64.
The effect on speed 1 is now weaker giving better
quality but less speed gain. Now also used in speed 2.
Change-Id: Iace33a97d5c3498dd2a9a8a4067351941abcbabc
As the pixel values beyond image border are duplicates of pixels
on edge, the change limits the mv search range, any mv beyond
the limits no longer produce new/different prediction values
as entire block with pixels used for subpel interpolation are
outside image border.
Change-Id: I4c6fdf06e33c1cef1489f5470ce0fb4e5e01fb79
For certain partition size, the function poniter may not be intialized
at all. The patch prevent the call if the pointer is not set.
Change-Id: I78b8c3992b639e8799a16b3c74f0973d07b8b9ac
This commit enables early termination in the rate-distortion
optimization search loop for chroma components. When the cumulative
rd cost is above the current best value, skip the rest per-block
transform/quantization/coeff_cost and continue to the next
prediction mode.
For bus_cif at 2000 kbps, the average run-time goes down from
168546ms -> 164678ms, (2% speed-up) at speed 0
36197ms -> 34465ms, (4% speed-up) at speed 1
Change-Id: I9d3043864126e62bd0166250d66b3170d520b3c0
Updating all foreach_transformed_block_visitor functions to work with
plane block size instead of general block. Removing a lot of duplicated
code.
Change-Id: I6a9069e27528c611f5a648e1da0c5a5fd17f1bb4
This change set is intermediate. The next one will remove all repetitive
plane_bsize calculations, because it will be passed as argument to
foreach_transformed_block_visitor.
Change-Id: Ifc12e0b330e017c6851a28746b3a5460b9bf7f0b
Return the distortion value in vp9_rd_pick_intra_mode_sb as sum of
dist_y and dist_uv. Remove the right shift operation on dist_uv,
and make it consistent with that of vp9_rd_pick_inter_mode_sb.
Change-Id: I9d564e242d9add38e32595d33b0e0dddb1d55e5b
This commit makes the rate-distortion optimization search of chroma
components consistent across all block sizes. It removes redundant
codes.
Change-Id: I7e76f54d045e8efdd41d84a164c71f55b484471b
VP9_COMMON is the right place to segmentatation struct because it has
global segmentation parameters, not something specific to macroblock
processing.
Change-Id: Ib9ada0c06c253996eb3b5f6cccf6a323fbbba708
Adds a speed feature to disable split partition search based on a
given threshold on the source variance. A tighter threshold derived
from the threshold provided is used to also disable horizontal and
vertical partitions.
Results on derfraw300:
threshold = 16, psnr = -0.057%, speedup ~1% (football)
threshold = 32, psnr = -0.150%, speedup ~4-5% (football)
threshold = 64, psnr = -0.570%, speedup ~10-12% (football)
Results on stdhdraw250:
threshold = 32, psnr = -0.18%, speedup is somewhat more than derf
because of a larger number of smoother blocks at higher resolution.
Based on these results, a threshold of 32 is chosen for speed 1,
and a threshold of 64 is chosen for speeds 2 and above.
Change-Id: If08912fb6c67fd4242d12a0d094783a99f52f6c6
This commit unifies the rate-distortion cost calculation process of
luma and chroma components. It allows early termination to be enabled
later in the rd search loop of chroma components, in consistent with
luma pixels.
Change-Id: I2e52a7c6496176bf2a5e3ef338d34ceb8aad9b3d
Making foreach_transformed_block_in_plane more clear (it's not finished
yet). Using explicit tx_size variable consistently instead of
(ss_txfrm_size / 2) or (ss_txfrm_size >> 1) expression.
Change-Id: I1b9bba2c0a9f817fca72c88324bbe6004766fb7d
The macro block mode info context originally contained an
entry for each 16x16 macroblock. In VP9 each entry refers
to an 8x8 region not a macro block, so the naming is misleading.
This first stage clean up changes the names of 3 entries in the
structure to remove the mb_ prefix.
TODO clean up the nomenclature more widely in respect of
mbmi and bmi.
Change-Id: Ia7305c6d0cb805dfe8cdc98dad21338f502e49c6
Don't do vertical or horizontal splits if subsize < min_partition_size,
except for edge blocks where it makes sense.
Change-Id: I479aa66ba1838d227b5de8312d46be184a8d6401
Enable SSE2 implementation of high precision 32x32 forward DCT. The
intermediate stacks are of 32-bits. The run-time goes down from
32126 cycles to 13442 cycles.
Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
'skippable' can remain unset and negatively affect later decisions
address one aspect of issue #599
Change-Id: Iffdf0ac2e49ac481c27dc27c87fa546d4167bb28
Loop filter configuration doesn't belong to macroblock, so moving it from
MACROBLOCKD to VP9_COMMON. Also moving the declaration of loopfilter struct
from vp9_blockd.h to vp9_loopfilter.h.
Change-Id: I4b3e34be9623b47cda35f9b1f9951f8c5b1d5d28
Adds a new subpel motion estimation function that uses a 2-level
tree-structured decision tree to eliminate redundant computations.
It searches fewer points than iterative search (which can search
the same point multiple times) but has the same quality roughly.
This is made the default setting at speeds 0 and 1, while at
speed 2 and above only a 1-level search is used.
Also includes various cleanups for consistency and redundancy removal.
Results:
derf: +0.012% psnr
stdhd: +0.09% psnr
Speedup of about 2-3%
Change-Id: Iedde4866f5475586dea0f0ba4cb7428fba24eee9
Different partitionings were not being evaluated against
best_rd and there were unnecessary calls to RDCOST. This
could have resulted in a non-optimal partioning being
selected.
I simplified the variables used to track the rate,
distortion and RD values throughout the function.
Change-Id: Ifa7085ee80d824e86791432a5bc6d8fea5a3e313
The low precision 32x32 fdct has all the intermediate steps within
16-bit depth, hence allowing faster SSE2 implementation, at the
expense of larger round-trip error. It was used in the rate-distortion
optimization search loop only.
Using the low precision version, in replace of the high precision one,
affects the compression performance by about 0.7% (derf, stdhd) at
speed 0. For speed 1, it makes derf set down by only 0.017%.
Change-Id: I4e7d18fac5bea5317b91c8e7dabae143bc6b5c8b
Removing the old one bsize_from_dim_lookup. Now we have a way to determine
block size for plane using its subsampling values (ss_size_lookup). And
then we can find the number of pixels in the block (num_pels_log2_lookup).
Change-Id: I6fc981da2ae093de81741d3d78eaefed11015db9
Removes some unused code and speed features, and organizes the
interfaces for fractional mv step functions for use in new speed
features to come.
In the process a new speed feature - number of iterations per
step during the subpel search - is exposed.
No change when this parameter is set as the original value of 3.
Results:
subpel_iters_per_step = 3: baseline
subpel_iters_per_step = 2: psnr -0.067%, 1% speedup
subpel_iters_per_step = 1: psnr -0.331%, 3-4% speedup
Change-Id: I2eba8a21f6461be8caf56af04a5337257a5693a8
Adds a few pattern searches to achieve various tradeoffs
between motion estimation complexity and performance.
The search framework is unified across these searches so that a
common pattern search function is used for all. Besides it will
be easier to experiment with various patterns or combinations
thereof at different scales in the future.
The new pattern search is multi-scale and is capable of using
different patterns at different scales.
The new hex search uses 8 points at the smallest scale
and 6 points at other scales.
Two other pattern searches - big-diamond and square are
also added. Big diamond uses 4 points at the smallest scale and
8 points in diamond shape at the larger scales.
Square is very similar conceptually to the default n-step search
but is somewhat faster since it keeps only one survivor across
all scales.
Psnr/speed-up results on derf300:
hex: -1.6% psnr%, 6-8% speed-up
big-diamond: -0.96% psnr, 4-5% speedup
square: -0.93% psnr, 4-5% speedup
Change-Id: I02a7ef5193f762601e0994e2c99399a3535a43d2
There was no benefit having this function. For example, inside
read_switchable_filter_type switchable filter context was calculated twice.
Change-Id: I79cd5bf95cbc0f6d8bf91a2e32289e01b18dcff1
This is in preparation for the SSE2 version of the high-precision
32x32 forward DCT which will share a lot of code with the existing
low precision version used for rate-distortion search.
Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0
Adds a speed feature to skip all intra modes other than
DC_PRED if the source variance is small. This feature is
made part of speed 1 and up.
Results on derf300: psnr -0.07%, speedup about 1-2%
Also uses the source variance to fine-tune the early
termination criteria when FLAG_EARLY_TERMINATE is on.
This feature is made part of speed 2 and up.
Results on derf300: psnr -0.52%, speedup about 5-7%
Change-Id: I59e38aa836557cfa5405ae706fc64815cbfe4232
This changeset allows to remove vp9_switchable_interp and
vp9_switchable_interp_map arrays and make code much clear. Actually we
still have to use these mapping but only inside read_interp_filter_type and
write_interp_filter_type functions.
Change-Id: I4026c6f8c4acefba6c81421b7bacbaa52cc45f50
Removing assign_and_clamp_mv function, making implementation of clamp_mv
and clamp_mv2 more clear and consistent.
Change-Id: Iecd08e1c1bf0379f8314ebe01811f8253f4ade58
Adds a function to compute source variance for various
sb_types to be used for pruning mode and partition searches.
[The existing activity measure function is currently specialized
for only 16x16 MBs and needs to be updated].
Change-Id: I22a41e6f1430184201487326fdbebb9b47e6fc24
If the partition is out of partition size range, we don't
need to process small partition information.
Change-Id: Ice9bfbbdebe1f2ef79271a3aee17de0ed4608376
use_min_partition_size and use_max_partition_size are not used
currently, and could be added back if needed later.
Change-Id: Ib22a9c06b064567a7c1d6d5445567ed77e0d3acc
The function name rd_pick_intra4x4mby_modes is confusing, so
I changed it to rd_pick_intra_sub_8x8_y_modes to better
reflect what the function does. Also added const qualifiers
to some of the input parameters and removed camel-case.
Change-Id: I23d53d4c7af5d79ed8a471acd59a09bbb47add39
This commit removes redundant arguments passing in the function of
rd_pick_reference_frame. This resolves the clang warnings about
potential use of uninitialized values.
Change-Id: Ic68f949a9f8fcd0a583786b0c75321104ea44739
Refactor the frame buffer referencing in choose_partition and make
it consistent with other places. This means to prevent potential
issues when we extend reference frame buffer.
Change-Id: I5ff33ed5f671e1f4cc7049622212769a9b4578d9
The tokenize_b function is only called when output flag is on. Hence
removing the conditional branch on it therein.
Change-Id: Ib709f47f23f39ca05a695faf86fa3377f11f2dd0
This commit optimizes the tokenization and detokenization operational
flow for speed-up. It makes the coding process about 0.3% faster at
speed 0.
Change-Id: I28008df7482874e4b5f237f2d418ff82a249dd56
This commit makes the encoder skip the redundant tokenization process
in the rate-distortion optimization search loop, while updating the
entropy contexts accordingly. It makes the speed 0 encoding process
about 0.5% faster at no performance change.
Change-Id: I34a4155a0b5332afeb45c93a51c7f35a294d685c
This commit provides special handle on 16x16 inverse 2D-DCT, where
only DC coefficient is quantized to be non-zero value.
Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
This allows us to increment the position at the band-level only as
we go from one band to the next; more importantly, that allows us to
use an add instead of multiply instruction, and omit the instruction
altogether if the band doesn't change from one coef to the next, thus
being slightly faster (probably more noticeable on systems where a
multiply is expensive, like arm).
Change-Id: I4343fe35b9f9a47fa00b217bdcbf5f91ff96c381
This commit brought back the shortcut implementation of 8x8/16x16
inverse 2D-DCT. When the eob <= 10, it skips the inverse transform
operations on row 4:7/4:15 in the first round. For bus_cif at 1000
kbps, this provides about 2% speed-up at speed 0.
Change-Id: I453e2d72956467d75be4ad8c04b4482ab889d572
This commit enables a special handle for the 8x8 inverse 2D-DCT,
where only DC coefficient is quantized to be non-zero. For bus_cif
at 2000 kbps, it provides about 1% speed-up at speed 0.
Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
Speed feature experiment to set an upper and lower
partition size limit based on what has been seen
in spatial neighbors.
This seems to gives quite reasonable speed gains in local
(10-15%) and when used with speed 0 the losses are small
(0.25% derf, 0.35% stdhd). However, for now I am only
enabling it on speed 1 as there may be clashes with the existing
temporal partition selection in speed 2.
Using a tighter min / max around the range derived from the
neighbors increases speed further but at the cost of a
bigger quality loss. However, I think this spatial method could
be combined with data from either the last frame or a variance
method (or both) to refine the range of minimum and maximum
partition size. I.e. consider the min and max from spatial and
temporal neighbors and the variance recommendation.
Change-Id: I1b96bf8b84368d6aad0c7aa600fe141b4f07435f
Used 3 * standard_deviation in internal threshold calculation
instead of fit curve. This actually approached the algorithm
better.
For comparison, similar tests were done:
The overall psnr loss is less than before.
1. derf set:
when static-thresh = 1, psnr loss is 0.329%;
when static-thresh = 500, psnr loss is 0.970%;
2. stdhd set:
when static-thresh = 1, psnr loss is 0.922%;
when static-thresh = 500, psnr loss is 1.307%;
Similar speedup is achieved. For example,
clip bitrate static-thresh psnr time
akiyo(cif) 500 0 48.952 5.077s(50f)
akiyo 500 500 48.866 4.169s(50f)
parkjoy(1080p) 4000 0 30.388 78.20s(30f)
parkjoy 4000 500 30.367 70.85s(30f)
sunflower(1080p) 4000 0 44.402 74.55s(30f)
sunflower 4000 500 44.414 68.69s(30f)
Change-Id: Ic78833642ce1911dbbd1cb6c899a2d7e2dfcc1f3
This option exists in VP8, and it was rewritten in VP9 to support
skipping on different partition levels. After prediction is done,
we can check if the residuals in the partition block will be all
quantized to 0. If this is true, the skip flag is set, and only
prediction data are needed in reconstruction. Based on DCT's energy
conservation property, the skipping check can be estimated in
spatial domain.
The prediction error is calculated and compared to a threshold.
The threshold is determined by the dequant values, and also
adjusted by partition sizes. To be precise, the DC and AC parts
for Y, U, and V planes are checked to decide skipping or not.
Test showed that
1. derf set:
when static-thresh = 1, psnr loss is 0.666%;
when static-thresh = 500, psnr loss is 1.162%;
2. stdhd set:
when static-thresh = 1, psnr loss is 1.249%;
when static-thresh = 500, psnr loss is 1.668%;
For different clips, encoding speedup range is between several
percentage and 20+% when static-thresh <= 500. For example,
clip bitrate static-thresh psnr time
akiyo(cif) 500 0 48.923 5.635s(50f)
akiyo 500 500 48.863 4.402s(50f)
parkjoy(1080p) 4000 0 30.380 77.54s(30f)
parkjoy 4000 500 30.384 69.59s(30f)
sunflower(1080p) 4000 0 44.461 85.2s(30f)
sunflower 4000 500 44.418 78.1s(30f)
Higher static-thresh values give larger speedup with larger
quality loss.
Change-Id: I857031ceb466ff314ab580ac5ec5d18542203c53
Removing unused constants, macros, and function declarations. Using
ROUND_POWER_OF_TWO macro, vp9_zero, vp9_copy where possible. Moving
#include from *.h to *.c. Merging for loops for motion vectors.
Change-Id: Ic3bf841764a2bb177128bb3a6d7aa8f68229cd13
Simplified the code that extracts and uses the motion
vectors for the 4 sub-partitions in rd_pick_partition.
Change-Id: Iaf698ef7ee3aef9edd59015e1ae065dd359b17d9
This commit makes the initialization of trellis coeff optimization
a per-plane operation, thereby eliminating the redundant steps in
encode_sby and encode_sbuv. It makes the encoder at speed 0 slightly
faster.
Change-Id: Iffe9faca6a109dafc0dd69dc7273cbdec19b17cd
The feature that uses small partition results as a measure to skip
mode evaluation at larger partition requires the flags to be reset.
The reset was missing in the code path that calls rd_use_partition().
Change-Id: Ia0a3a0aee1a862b6e2333d596808db7c48033d50
Adding plane type check condition because it was always used outside of
get_tx_type_{4x4, 8x8, 16x16}.
Change-Id: I02f0bbfee8063474865bd903eb25b54d26e07230
Although local copies of the mode member variables
(mode, ref_frame) were made, they were not used in
all places. Also, made a local copy of the
second_ref_frame member.
Change-Id: I84d8c822e5cb3d8a02fc3de8a4037ca3fea8bfad
Prevents doing duplicate IDCTs; encoding of first 50 frames of bus
(speed 0) @ 1500kbps goes from 1min4.0 to 1min3.5, i.e. 0.87% faster
overall.
Change-Id: I2df39e29ed9d5ea5e7d2704a34940ba622832ddd
Encoding time of first 50 frames of bus (speed 0) @ 1500kbps goes from
1min5.4 to 1min4.0, i.e. 2.2% faster overall.
Change-Id: I8c32f2aff9a649ce7dd49d910dc5ba16b99c3bc6
Counts are separate from frame context. We have several frame contexts but
need only one copy of all counts.
Change-Id: I5279b0321cb450bbea7049adaa9275306a7cef7d
The struct optimize_block_args is defined same as encode_b_args.
Remove this redundant definition, and use encode_b_args consistently.
Change-Id: I1703aeeb3bacf92e98a34f4355202712110173d9
All filters are interpolating now, so we don't need this array, all
values from this array are evaluated to true.
Change-Id: I9af6d8219ae0eb984063cd15e4e2296374ae4961
The xform_quant() module is only used by inter modes, hence removing
the redundant switches therein conditioned on tx_type.
Change-Id: Ib87ce5b2f2e4cbf3ceb133a1108afa173c933a3f
When all the transform coefficients were quantized to zero, skip
the inverse transform operation. For bus_cif at 1000 kbps, the
runtime goes from 154967ms -> 149842ms, i.e., about 3% speed-up,
at speed 0.
Change-Id: Ic0a813fff5e28972d4888ee42d8747846a6c3cc6
many structures use bw and bh and they have different meanings. This cl attempts
to start this clean up and remove unneccessary 2 step look up log and then
shift operations...
also removed partition type multiple operation code in bitstream.c.
Change-Id: I7e03e552bdfc0939738e430862e3073d30fdd5db
Renamed:
MAX_MB_SEGMENTS to MAX_SEGMENTS
MB_SEG_TREE_PROBS to SEG_TREE_PROBS
The minimum unit for segmentation in the segment map
is now 8x8 so it is misleading to use MB_ as macro-block
traditionally refers to a 16x16 region.
Change-Id: I0b55a6f0426bb46dd13435fcfa5bae0a30a7fa22
Moving common encoder/decoder code to update_tx_counts. Also renaming
vp9_get_pred_probs_tx_size to get_tx_probs2 and adding get_tx_probs to
call vp9_get_pred_context_tx_size inside read_selected_tx_size only once
(twice before).
Change-Id: Ia50247f3893de88ef8e9041b0d44be44a40aaa4d
Optional change in diamond search to continue in the best move
direction until that move turns worse.
This is still WIP since the exact way the new method is to be used is
under investigation. One option is to make it an option in diamond
search and use it only when motion is large.
Overall slightly positive on derfraw300 +0.02%, stdhdraw +0.13%,
but works a lot better for high motion sequences (ex. football : +1%).
Change-Id: If88e01a6021daa0cda934680cdc70be1ee04f798
Stack the rate-distortion statistics in the sub8x8 rd loop. This allows
the encoder to skip the forward transform, quantization, and coeff cost
estimation, in the sub8x8 rd optimization search, if the motion
vector(s) are of integer pixel value, and have been tested in the
previous prediction filter type rd loops of the same block.
This gives about 2% speed-up for bus_cif at 2000 kpbs, for speed 0.
Its efficacy depends how frequently the motion search will select an
integer motion vector.
Change-Id: Iee15d4283ad4adea05522c1d40b198b127e6dd97
Mode search order in rd loop changed to better reflect
observed hit counts.
Also some adjustment of the baseline mode rd thresholds
to reflect the order change and observed frequencies.
Change-Id: I47a131cc83e11551df8add6d6d8d413d78d3a63c
When CONFIG_POSTPROC is set there was a now
invalid reference to cm->filter_level.
Changed to cpi->mb.e_mbd.lf.filter_level in line with
change Iaf5fb71c33719cdfa1b991f671caf071be9ea035
Change-Id: If746e60044903f7ba8d0d346225b3d015226c7d0
This commit allows the encoder to skip a few buffer update steps in
rd_pick_best_mbsegmentation, when early breakout has been triggered
in the rd_check_segment_txsize. It provides about 1% speed-up for
bus_cif at 2000 kbps, in the settings of speed 0.
Change-Id: Ica034f10a24dec572b397d8389a2b81020ebc0b9
At speed 2, due to the threshold scheme used, it is possible the rate
and distortion assigned with INT_MAX value. The patch added checking
to prevent the INT_MAX value is used in further calculation of RD
scores. The patch also changed the assertion in rd_use_partition() to
be mirror similar assertion in rd_pick_partition().
Change-Id: Idb52c543cc1e10abdf6e6a5d6e9cb535a42214dc
Adding loopfilter struct with fields from MACROBLOCKD and VP9Common.
Eventually it will be moved to vp9_loopfilter.h for better code structure.
Change-Id: Iaf5fb71c33719cdfa1b991f671caf071be9ea035
This patch modifies the auto_mv_step_size speed feature to
use a combination of the maximum magnitude mv from the last
inter frame, and the maximum magnitude mv for the two reference
mvs with the same reference. For arf frames, the max mav step
for the resolution is used.
The bounds therefore are slightly tighter. The feature is made
a speed 1 feature.
Rebased.
Results (when this feature is turned on over speed 0):
derfraw300: -0.046% psnr, about 5+% speedup
(tested on football: goes from 4m30.760s to 4m17.410s).
Change-Id: If492797a61b0b4b3e58c0b8f86afb880165fc9f6