Adds a speed feature to eliminate full-rd computation if the modeled
rd or rd based on a different parameter in the same mode is already
a lot larger than the best rd yet.
Specifically, only search the sharp and smooth filters if the modeled
rd cost based on the regular filter is within a certain factor of the
best rd cost so far. Also, skip full-rd computation of non splitmv
inter modes if the modeled rd cost based on pred error is within the
same factor of the best rd cost so far.
Also adds some enhancements in the rd search for splitmv mode to
speed things up by early breakouts. Negligible impact on performance.
Resuts on derfraw300:
psnr: -0.013% with the splitmv enhancements, -0.24% with the rd
breakout feature on.
speedup: 6% with splitmv enhancements, 20% with also residual breakout
(tested on football sequence at 600 Kbps)
Change-Id: I37abc308ea9f110c1679ce649b6a7e73ab1ad5fc
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.
Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
Replace case statement with lookup.
Small speed gain at low speed settings but at speed 2+ where the
number of motion searches etc. falls the impact rises to ~3-4%.
Change-Id: Idff639b7b302ee65e042b7bf836943ac0a06fad8
Change-Id: I5940719a4a161f8c26ac9a6753f1678494cec644
- The vp9 mbfilter C code will branch on flat and mask. This CL
will perform both branches and combine the data. A later CL will
perform a check to see if all patch will take one branch.
- These functions are about 1.75 times faster than the C code on
Nexus 7.
PS #3
- Changed all functions to dub limit, blimit, and thresh from
vld {dx[]}, freeing up r4-r6.
- Changed code to use vbif to reduce one instruction and free
up a d register.
Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
This probably has a mildly negative impact on performance, but will
(in future commits - or possibly merged with this one) allow SIMD
implementations of individual intra prediction functions. We may
perhaps want to consider having separate functions per txfm-size
also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
each intra prediction mode), but I haven't played much with that
yet.
Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269
In the rare case were 4x4 interior filtering was called for but no
8x8 or larger filtering takes place, the previous code was skipping
the filtering. This patch fixes the issue by including the interior
mask in the overall mask for the filter application loops.
Change-Id: I4a0b65056c64f97478827c2ff41e0914fc7779d0
Encode time for first 50 frames of bus (speed 0) @ 1500kbps goes from
2min10.9 to 2min10.5, i.e. 0.3% faster overall, basically because we
prevent the call overhead.
Change-Id: I1eab1a95dd3eae282f9b866f1f0b3dcadff073d5
intermediate_height for horizontal filtering must be at least 8
pixels to be able to do vertical filtering correctly. Currently
it can be less for small block and y_step_q4 sizes.
Change-Id: I2ee28b0591b2041c2fa9844d0ae2ff8a1a59cc21
(1) Refines the modeling function and uses that to add some speed
features. Specifically, intead of using a flag use_largest_txfm as
a speed feature, an enum tx_size_search_method is used, of which
two of the types are USE_FULL_RD and USE_LARGESTALL. Two other
new types are added:
USE_LARGESTINTRA (use largest only for intra)
USE_LARGESTINTRA_MODELINTER (use largest for intra, and model for
inter)
(2) Another change is that the framework for deciding transform type
is simplified to use a heuristic count based method rather than
an rd based method using txfm_cache. In practice the new method
is found to work just as well - with derf only -0.01 down.
The new method is more compatible with the new framework where
certain rd costs are based on full rd and certain others are
based on modeled rd or are not computed. In this patch the existing
rd based method is still kept for use in the USE_FULL_RD mode.
In the other modes, the count based method is used.
However the recommendation is to remove it eventually since the
benefit is limited, and will remove a lot of complications in
the code
(3) Finally a bug is fixed with the existing use_largest_txfm speed feature
that causes mismatches when the lossless mode and 4x4 WH transform is
forced.
Results on derf:
USE_FULL_RD: +0.03% (due to change in the tables), 0% encode time reduction
USE_LARGESTINTRA: -0.21%, 15% encode time reduction (this one is a
pretty good compromise)
USE_LARGESTINTRA_MODELINTER: -0.98%, 22% encode time reduction
(currently the benefit of modeling is limited for txfm size selection,
but keeping this enum as a placeholder) .
USE_LARGESTALL: -1.05%, 27% encode-time reduction (same as existing
use_largest_txfm speed feature).
Change-Id: I4d60a5f9ce78fbc90cddf2f97ed91d8bc0d4f936
This should significantly speedup cost_coeffs(). Basically what the
patch does is to make the neighbour arrays padded by one item to
prevent an eob check in get_coef_context(), then it populates each
col/row scan and left/top edge coefficient with two times the same
neighbour - this prevents a single/double context branch in
get_coef_context(). Lastly, it populates neighbour arrays in pixel
order (rather than scan order), so we don't have to dereference the
scantable to get the correct neighbours.
Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.
Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56