this was never fleshed out in the context of VP8, for which it was
added. for VP9 it has no meaning.
Change-Id: Iba2ecc026d9e947067b96690245d337e51e26eff
Implements some of the helper functions more efficiently with
lookups rathers than branches. Modeling function is consolidated
to reduce some computations.
Also merged the two enums BLOCK_SIZE_TYPES and BlockSize into
one because there is no need to keep them separate (even though
the semantics are a little different).
No bitstream or output change.
About 0.5% speedup
Change-Id: I7d71a66e8031ddb340744dc493f22976052b8f9f
For some reason iOS builds take a really long time to sort this
function out.
It's not used anywhere so remove it.
Change-Id: Ia5c8513a0d9c7eb32641cca58ca1c1113e2dd9f4
Adding segmentation struct to vp9_seg_common.h. Struct members are from
macroblockd and VP9Common structs. Moving segmentation related constants
and enums to vp9_seg_common.h.
Change-Id: I23fabc33f11a359249f5f80d161daf569d02ec03
Independent horizontal and vertical implementations.
Requires that blocks be built from 4x4 and [xy]_step_q4 == 16
6-10% improvement. CIF improved the least.
Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
The function encode_block is called only by inter-prediction modes,
hence removing the transform type branching there.
Change-Id: I34a3172e28ce2388835efd0f8781922211bff857
With sf->auto_mv_step_size on it is questionable
whether sf->reduce_first_step_size is worthwhile.
At speed 2 it was not having a big impact.
Even at speed 2 sf->optimize_coefficients = 0 is not
having a big speed imapct so for now I have moved it
down into a higher speed setting.
Change-Id: I8a54de76d486ad37aabce76474889da2768b14c1
Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from
292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the
overall runtime of speed 0 goes from 301s to 295s (2% speed-up).
Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628
Where possible, do the 16 pixel wide filter while doing the horizontal
filtering pass. The same approach can be taken for the mbloop_filter
when that's implemented. Doing so on the vertical pass is a little more
involved, but possible.
Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320
This commit fixed the mis-use of the tx_type for inverse transform
in intra4x4 rate-distortion optimization loop. It improves the
overall coding performance.
Change-Id: I7fe9953175b74890357dbcee33c138573766e980
Adds a speed feature to eliminate full-rd computation if the modeled
rd or rd based on a different parameter in the same mode is already
a lot larger than the best rd yet.
Specifically, only search the sharp and smooth filters if the modeled
rd cost based on the regular filter is within a certain factor of the
best rd cost so far. Also, skip full-rd computation of non splitmv
inter modes if the modeled rd cost based on pred error is within the
same factor of the best rd cost so far.
Also adds some enhancements in the rd search for splitmv mode to
speed things up by early breakouts. Negligible impact on performance.
Resuts on derfraw300:
psnr: -0.013% with the splitmv enhancements, -0.24% with the rd
breakout feature on.
speedup: 6% with splitmv enhancements, 20% with also residual breakout
(tested on football sequence at 600 Kbps)
Change-Id: I37abc308ea9f110c1679ce649b6a7e73ab1ad5fc
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.
Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
Encode time of first 50 frames of bus (speed 0) @ 1500kbps goes from
2min4.9 to 2min3.1, i.e. a 1.4% speedup overall.
Change-Id: I9b25e87974430cb942caa276410bb2eda815bd83
Replace case statement with lookup.
Small speed gain at low speed settings but at speed 2+ where the
number of motion searches etc. falls the impact rises to ~3-4%.
Change-Id: Idff639b7b302ee65e042b7bf836943ac0a06fad8
Change-Id: I5940719a4a161f8c26ac9a6753f1678494cec644
- The vp9 mbfilter C code will branch on flat and mask. This CL
will perform both branches and combine the data. A later CL will
perform a check to see if all patch will take one branch.
- These functions are about 1.75 times faster than the C code on
Nexus 7.
PS #3
- Changed all functions to dub limit, blimit, and thresh from
vld {dx[]}, freeing up r4-r6.
- Changed code to use vbif to reduce one instruction and free
up a d register.
Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
This probably has a mildly negative impact on performance, but will
(in future commits - or possibly merged with this one) allow SIMD
implementations of individual intra prediction functions. We may
perhaps want to consider having separate functions per txfm-size
also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
each intra prediction mode), but I haven't played much with that
yet.
Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269
In the rare case were 4x4 interior filtering was called for but no
8x8 or larger filtering takes place, the previous code was skipping
the filtering. This patch fixes the issue by including the interior
mask in the overall mask for the filter application loops.
Change-Id: I4a0b65056c64f97478827c2ff41e0914fc7779d0
The resulting reconstruction is never used, thus it just wastes CPU
cycles. Reduces encode time of first 50 frames of bus (speed 0) @
1500kbps from 2min2.0 to 2min1.2, i.e. a 0.65% overall speedup.
Change-Id: I74755ca3aadc21e2be220f486259060bd4088c45
Encode time for first 50 frames of bus (speed 0) @ 1500kbps goes from
2min10.9 to 2min10.5, i.e. 0.3% faster overall, basically because we
prevent the call overhead.
Change-Id: I1eab1a95dd3eae282f9b866f1f0b3dcadff073d5
Changes cost_mv_ref() into doing a LUT into pre-calculated cost
arrays instead. Encode time of first 50 frames of bus (speed 0)
@ 1500kbps goes from 2min11.6 to 2min10.9, i.e. 0.5% faster overall.
Change-Id: If186e92c34c201b29cbbc058785a15c9c09e433a
First 50 frames of bus @ 1500kbps (speed 0) goes from 2min12.6 to
2min11.6, i.e. 0.75% overall speedup.
Change-Id: I67054f8146e82a02b6457c51a1c8627a937e5e1e
Encode time of first 50 frames of bus (speed 0) @ 1500kbps goes from
2min4.9 to 2min3.1, i.e. a 1.4% speedup overall.
Change-Id: Ibe8b08d159797504c5d0c5122de1b6da3b6595e0
Overall, on all test sets, this gains about +0.2% on all metrics.
City is a clip where this really hurts (-1.0% on all metrics), I'm
not quite sure why yet. Maybe interesting to look into in the future.
Change-Id: I6f0eecb20e72f0194633270d30bf00d76d9eae78
Skips mode searches for intra and compound inter modes depending
on the best mode so far and the reference frames. The various
heuristics to be used are selected by bits from a flag. The
previous direction based intra mode search pruning is also absorbed
in this framework.
Specifically the flags and their impact are:
1) FLAG_SKIP_INTRA_BESTINTER (skip intra mode search for oblique
directional modes and TM_PRED if the best so far is
an inter mode)
derfraw300: -0.15%, 10% speedup
2) FLAG_SKIP_INTRA_DIRMISMATCH (skip D27, D63, D117 and D153
mode search if the best so far is not one of the closest
hor/vert/diagonal directions.
derfraw300: -0.05%, about 9% speedup
3) FLAG_SKIP_COMP_BESTINTRA (skip compound prediction mode
search if the best so far is an intra mode)
derfraw300: -0.06%, about 7-8% speedup
4) FLAG_SKIP_COMP_REFMISMATCH (skip compound prediction search
if the best single ref inter mode does not have the same ref
as one of the two references being tested in the compound mode)
derfraw300: -0.56%, about 10% speedup
Change-Id: I1a736cd29b36325489e7af9f32698d6394b2c495
intermediate_height for horizontal filtering must be at least 8
pixels to be able to do vertical filtering correctly. Currently
it can be less for small block and y_step_q4 sizes.
Change-Id: I2ee28b0591b2041c2fa9844d0ae2ff8a1a59cc21
This commit allows encoder to detect the cumulative rate-distortion
cost per transformed block inside a partition. If the cumulative
rd cost is already above the best rd value, it terminates the rest
operations and continue to next prediction mode test.
It reduces the runtime of bus at target bit-rate 2000 from 308 second
to 266 second, i.e., about 13% speed-up at no performance penalty.
Change-Id: I5f15a3d8955d97031d5653006027866a00654e7a
sf->unused_mode_skip_lvl. Tests modes as normal for all
sizes at or below the given level. At larger sizes it skips
all modes that were not chosen at any smaller size.
Hence setting BLOCK_SIZE_SB64X64 is in effect off.
Setting BLOCK_SIZE_AB4X4 will only consider modes that
were chosen for one or more 4x4 blocks at larger sizes.
sf->reference_masking.
Do a test encode of the NONE partition at one size and create
a reference frame mask based on the best rd choice. In the
full search only allow this reference frame.
Currently it is testing 64x64 and repeats this in the full search.
This does not work well with Jim's Partition code just now and
is disabled by default.
Change-Id: I8f8c52d2ef4a0c08100150b0ea4155d1aaab93dd
This patch allows the frame_parallel_decoding_mode flag
to be set instead of returning a codec error.
Change-Id: I4a1631c625723ac8873290d0fd0211074a87d112
This commit adds a speed feature where only squared partition are
evaluated in partition picking. Enable this feature in cpu-used 2
reduces encoding time by ~30%.
loss of compression:
-0.9% on cif set
-1.23% on stdhd
Change-Id: Ia6fad11210f0b78365abb889f9245604513be5b9
If none of the 16 coefficients that we quantize per loop iteration
are larger than the zbin, directly skip to the next round of coeffs,
rather than doing a full quantize loop that will eventually result
in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
no significantly positive results for smaller transforms, so is not
used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).
Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
This speed feature will skip searching the directional intra prediction
modes D63, D117, D27, D153 if the best intra mode so far is not one of
the diagonal, horizontal or vertical directions closest to the respective
directions being tested. In other words, this implements a sort of
binary search in the angular domain.
Speedup: about 9-10%
Results: -0.05% only on derfraw300.
Change-Id: I413584c41f2a3e8dabfbdeb40718c8fc4b1d63a2
(1) Refines the modeling function and uses that to add some speed
features. Specifically, intead of using a flag use_largest_txfm as
a speed feature, an enum tx_size_search_method is used, of which
two of the types are USE_FULL_RD and USE_LARGESTALL. Two other
new types are added:
USE_LARGESTINTRA (use largest only for intra)
USE_LARGESTINTRA_MODELINTER (use largest for intra, and model for
inter)
(2) Another change is that the framework for deciding transform type
is simplified to use a heuristic count based method rather than
an rd based method using txfm_cache. In practice the new method
is found to work just as well - with derf only -0.01 down.
The new method is more compatible with the new framework where
certain rd costs are based on full rd and certain others are
based on modeled rd or are not computed. In this patch the existing
rd based method is still kept for use in the USE_FULL_RD mode.
In the other modes, the count based method is used.
However the recommendation is to remove it eventually since the
benefit is limited, and will remove a lot of complications in
the code
(3) Finally a bug is fixed with the existing use_largest_txfm speed feature
that causes mismatches when the lossless mode and 4x4 WH transform is
forced.
Results on derf:
USE_FULL_RD: +0.03% (due to change in the tables), 0% encode time reduction
USE_LARGESTINTRA: -0.21%, 15% encode time reduction (this one is a
pretty good compromise)
USE_LARGESTINTRA_MODELINTER: -0.98%, 22% encode time reduction
(currently the benefit of modeling is limited for txfm size selection,
but keeping this enum as a placeholder) .
USE_LARGESTALL: -1.05%, 27% encode-time reduction (same as existing
use_largest_txfm speed feature).
Change-Id: I4d60a5f9ce78fbc90cddf2f97ed91d8bc0d4f936
Uses mapping tables instead of complicated modulo/division
operations for prob mapping for forward updates.
No bit-stream or output change.
Change-Id: Ifd9ce8ac1437835c305c94f64c18273c7a68f546
Added a speed feature in speed 1 to disable splitmv for HD (>=720)
clips. Test result on stdhd set: 0.3% psnr loss and 0.07% ssim
loss. Encoding speedup is 36%.
(For reference: The test result on derf set showed 2% psnr loss
and 1.6% ssim loss. Encoding speedup is 34%. SPLITMV should be
enabled for small resolution videos.)
Change-Id: I54f72b94f506c6d404b47c42e71acaa5374d6ee6
Compute the rate-distortion cost per transformed block, and cumulate
the cost through all blocks inside a partition. This allows encoder
to detect if the cumulative rd cost is already above the best rd cost,
thereby enabling early termination in the rate-distortion optimization
search.
Change-Id: I0a856367a9a7b6dd0b466e7b767f54d5018d09ac
Remove the use of sf->comp_inter_joint_search_thresh
from the baseline speed 0. Approx +0.4% on derf.
Change-Id: Icc14db98909830f40e5ac66130d40e78d2e55c71