This commit enables forcing all coefficients zero per transformed
block, when its rate-distortion cost is lower than regular coeff
quantization.
The overall performance improvement (including its parent patch on
calculating rd cost per transformed block) at speed 1:
derf: 0.298%
yt: 0.452%
hd: 0.741%
stdhd: 0.006%
Change-Id: I66005fe0fd7af192c3eba32e02fd6d77952accb5
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of pointers to MODE_INFO structs. The
MODE_INFO structs are now stored as a stream (decoder only),
eliminating unnecessary copies and is a little more cache
friendly.
Change-Id: I031d376284c6eb98a38ad5595b797f048a6cfc0d
The 16x16 transform unit test suggested that the peak coefficient
value can reach 32639. This could cause potential overflow issue
in the SSSE3 implmentation of 16x16 block quantization. This commit
fixes this issue by replacing addition with saturated addition.
Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of a pointer to a MODE_INFO struct and
a "in the image" flag. The MODE_INFO structs are now stored
as a stream, eliminating unnecessary copies and is a little
more cache friendly.
For the test clips used, the decoder performance improved
by ~4.3% (1080p) and ~9.7% (720p).
Patch Set 2: Re-encoded clips with latest. Now ~1.7% (1080p)
and 5.9% (720p).
Change-Id: I846f29e88610fce2523ca697a9a9ef2a182e9256
Updating all foreach_transformed_block_visitor functions to work with
plane block size instead of general block. Removing a lot of duplicated
code.
Change-Id: I6a9069e27528c611f5a648e1da0c5a5fd17f1bb4
This change set is intermediate. The next one will remove all repetitive
plane_bsize calculations, because it will be passed as argument to
foreach_transformed_block_visitor.
Change-Id: Ifc12e0b330e017c6851a28746b3a5460b9bf7f0b
Making foreach_transformed_block_in_plane more clear (it's not finished
yet). Using explicit tx_size variable consistently instead of
(ss_txfrm_size / 2) or (ss_txfrm_size >> 1) expression.
Change-Id: I1b9bba2c0a9f817fca72c88324bbe6004766fb7d
The low precision 32x32 fdct has all the intermediate steps within
16-bit depth, hence allowing faster SSE2 implementation, at the
expense of larger round-trip error. It was used in the rate-distortion
optimization search loop only.
Using the low precision version, in replace of the high precision one,
affects the compression performance by about 0.7% (derf, stdhd) at
speed 0. For speed 1, it makes derf set down by only 0.017%.
Change-Id: I4e7d18fac5bea5317b91c8e7dabae143bc6b5c8b
This commit provides special handle on 16x16 inverse 2D-DCT, where
only DC coefficient is quantized to be non-zero value.
Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
This allows us to increment the position at the band-level only as
we go from one band to the next; more importantly, that allows us to
use an add instead of multiply instruction, and omit the instruction
altogether if the band doesn't change from one coef to the next, thus
being slightly faster (probably more noticeable on systems where a
multiply is expensive, like arm).
Change-Id: I4343fe35b9f9a47fa00b217bdcbf5f91ff96c381
This commit brought back the shortcut implementation of 8x8/16x16
inverse 2D-DCT. When the eob <= 10, it skips the inverse transform
operations on row 4:7/4:15 in the first round. For bus_cif at 1000
kbps, this provides about 2% speed-up at speed 0.
Change-Id: I453e2d72956467d75be4ad8c04b4482ab889d572
This commit enables a special handle for the 8x8 inverse 2D-DCT,
where only DC coefficient is quantized to be non-zero. For bus_cif
at 2000 kbps, it provides about 1% speed-up at speed 0.
Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
This commit makes the initialization of trellis coeff optimization
a per-plane operation, thereby eliminating the redundant steps in
encode_sby and encode_sbuv. It makes the encoder at speed 0 slightly
faster.
Change-Id: Iffe9faca6a109dafc0dd69dc7273cbdec19b17cd
Adding plane type check condition because it was always used outside of
get_tx_type_{4x4, 8x8, 16x16}.
Change-Id: I02f0bbfee8063474865bd903eb25b54d26e07230
The struct optimize_block_args is defined same as encode_b_args.
Remove this redundant definition, and use encode_b_args consistently.
Change-Id: I1703aeeb3bacf92e98a34f4355202712110173d9
The xform_quant() module is only used by inter modes, hence removing
the redundant switches therein conditioned on tx_type.
Change-Id: Ib87ce5b2f2e4cbf3ceb133a1108afa173c933a3f
When all the transform coefficients were quantized to zero, skip
the inverse transform operation. For bus_cif at 1000 kbps, the
runtime goes from 154967ms -> 149842ms, i.e., about 3% speed-up,
at speed 0.
Change-Id: Ic0a813fff5e28972d4888ee42d8747846a6c3cc6
many structures use bw and bh and they have different meanings. This cl attempts
to start this clean up and remove unneccessary 2 step look up log and then
shift operations...
also removed partition type multiple operation code in bitstream.c.
Change-Id: I7e03e552bdfc0939738e430862e3073d30fdd5db
Cycle times:
4x4: 151 to 131 cycles (15% faster)
8x8: 334 to 306 cycles (9% faster)
16x16: 1401 to 1368 cycles (2.5% faster)
32x32: 7403 to 7367 cycles (0.5% faster)
Total encode time of first 50 frames of bus @ 1500kbps (speed 0)
goes from 1min39.2 to 1min38.6, i.e. a 0.67% overall speedup.
Change-Id: I799a49460e5e3fcab01725564dd49c629bfe935f
Also inline some of the block calculations to assist the compiler to
not do silly things like calculating the same offset (or converting
between raster/transform block offset or block, mi and pixel unit)
many, many, many times.
Cycle times:
4x4: 584 -> 505 cycles (16% faster)
8x8: 1651 -> 1560 cycles (6% faster)
16x16: 7897 -> 7704 cycles (2.5% faster)
32x32: 16096 -> 15852 cycles (1.5% faster)
Overall, this saves about 0.5 seconds (1min49.8 -> 1min49.3) on the
first 50 frames of bus (speed 0) @ 1500kbps, i.e. 0.5% overall.
Change-Id: If3dd62453f8e2ab9d4ee616bc4ea956fb8874b80
Skip the inverse transform and reconstruction of inter-mode coded
blocks in the rate-distortion optimization loop, when skip_encode_sb
feature is turned on. This provides about 1% speed-up at speed 0,
and 1.5% speed-up at speed 1. No performance change in both settings.
Change-Id: I2932718bf4d007163702b61b16b6ff100cf9d007
This speed feature allows the encoder to largely remove the spatial
dependency between blocks inside a 64x64 superblock, thereby removing
the need to repeatedly encode superblocks per partition type in the
rate-distortion optimization loop.
A major challenge lies in the intra modes tested in the rate-distortion
optimization loop. The subsequent blocks do not have access to the
reconstructed boundary pixels without the intermediate coding steps.
This was resolved by using the original pixels for intra prediction
in the rd loop, followed by an appropriately designed distortion
modeling on the quantization parameters. Experiments also suggested
that the performance impact is more discernible at lower bit-rate/psnr
settings. Hence a quantizer dependent threshold is applied to deactivate
skip of block coding.
For bus_cif at 2000 kbps,
speed 0: runtime 269854ms -> 237774ms (12% speed-up) at 0.05dB
performance loss.
speed 1: runtime 65312ms -> 61536ms, (7% speed-up) at 0.04dB
performance loss.
This operation is currently turned on in settings of speed 1.
Change-Id: Ib689741dfff8dd38365d8c1b92860a3e176f56ec
The function encode_block is called only by inter-prediction modes,
hence removing the transform type branching there.
Change-Id: I34a3172e28ce2388835efd0f8781922211bff857
First 50 frames of bus @ 1500kbps (speed 0) goes from 2min12.6 to
2min11.6, i.e. 0.75% overall speedup.
Change-Id: I67054f8146e82a02b6457c51a1c8627a937e5e1e
Compute the rate-distortion cost per transformed block, and cumulate
the cost through all blocks inside a partition. This allows encoder
to detect if the cumulative rd cost is already above the best rd cost,
thereby enabling early termination in the rate-distortion optimization
search.
Change-Id: I0a856367a9a7b6dd0b466e7b767f54d5018d09ac
This should significantly speedup cost_coeffs(). Basically what the
patch does is to make the neighbour arrays padded by one item to
prevent an eob check in get_coef_context(), then it populates each
col/row scan and left/top edge coefficient with two times the same
neighbour - this prevents a single/double context branch in
get_coef_context(). Lastly, it populates neighbour arrays in pixel
order (rather than scan order), so we don't have to dereference the
scantable to get the correct neighbours.
Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.
Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.
Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
Makes cost_coeffs() a lot faster:
4x4: 236 -> 181 cycles
8x8: 888 -> 588 cycles
16x16: 3550 -> 2483 cycles
32x32: 17392 -> 12010 cycles
Total encode time of first 50 frames of bus (speed 0) @ 1500kbps goes
from 2min51.6 to 2min43.9, i.e. 4.7% overall speedup.
Change-Id: I16b8d595946393c8dc661599550b3f37f5718896
Cycle timings for first 3 frames of bus (speed 0) at 1500kbps:
4x4: 298 -> 234 cycles
8x8: 1227 -> 878 cycles
16x16: 23426 -> 18134 cycles
32x32: 4906 -> 3664 cycles
Total encode time of first 50 frames of bus @ 1500kbps (speed 0) goes
from 3min0.7 to 2min51.6 seconds, i.e. 5.3% faster.
Change-Id: I68a0e1b530b0563b84a67342cca4b45146077e95
This commit enables configurable reference buffer pointer for intra
predictor. This allows later removal of spatial dependency between
blocks inside a 64x64 superblock in the rate-distortion optimization
loop.
Change-Id: I02418c2077efe19adc86e046a6b49364a980f5b1
This commit makes use of dual fdct32x32 versions for rate-distortion
optimization loop and encoding process, respectively. The one for
rd loop requires only 16 bits precision for intermediate steps.
The original fdct32x32 that allows higher intermediate precision (18
bits) was retained for the encoding process only.
This allows speed-up for fdct32x32 in the rd loop. No performance
loss observed.
Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
Change the argument of get_uv_tx_size() to be an MBMI pointer, so that the
correct column's MBMI can be passed to the function.
Change-Id: Ied6b8ec33b77cdd353119e8fd2d157811815fc98
Code intra/inter, then comp/single, then the ref frame selection.
Use contextualization for all steps. Don't code two past frames
in comp pred mode.
Change-Id: I4639a78cd5cccb283023265dbcc07898c3e7cf95
This avoids encoding tokens for blocks that are entirely
in the UMV border. This changes the bitstream.
Change-Id: I32b4df46ac8a990d0c37cee92fd34f8ddd4fb6c9
Migrates costing changes/fixes from the rebalance expt to the head
without the expt on.
Rebased.
Change-Id: I51677d62f77ed08aca8d21a4c9a13103eb8de93f
Results:
derfraw300: +0.126%
This patch changes the coefficient tree to move the EOB to below
the ZERO node in order to save number of bool decodes.
The advantages of moving EOB one step down as opposed to two steps down
in the other parallel patch are: 1. The coef modeling based on
the One-node becomes independent of the tree structure above it, and
2. Fewer conext/counter increases are needed.
The drawback is that the potential savings in bool decodes will be
less, but assuming that 0s are much more predominant than 1's the
potential savings is still likely to be substantial.
Results on derf300: -0.237%
Change-Id: Ie784be13dc98291306b338e8228703a4c2ea2242
Proposal for tuning the residual coding by changing how the context
from previous tokens is calculated. Storing the energy class of previous
tokens instead of the token itself eases the critical path of
HW implementations.
Change-Id: I6d71d856b84518f6c88de771ddd818436f794bab
Removed one 4x4 prediction step that was unnessary in the rd loop.
Removed a unused modecosts estimate from encoder side.
Change-Id: I65221a52719d6876492996955ef04142d2752d86
1. remove prediction mode conversion
2. unified bmode, same for key and non-key frame
3. set I4X4_PRED count for pdf to 0, as I4X4_PRED is no longer
coded ever. It is determined by ref_frame and block partition
Change-Id: If5b282957c24339b241acdb9f2afef85658fe47d
This commit changed the encoding and decoding of intra blocks to be
based on transform block. In each prediction block, the intra coding
iterates thorough each transform block based on raster scan order.
This commit also fixed a bug in D135 prediction code.
TODO next:
The RD mode/txfm_size selection should take this into account when
computing RD values.
Change-Id: I6d1be2faa4c4948a52e830b6a9a84a6b2b6850f6
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22
This is a mostly-working implementation of an extra channel in the
bitstream. Configure with --enable-alpha to test. Notable TODOs:
- Add extra channel to all mismatch tests, PSNR, SSIM, etc
- Configurable subsampling
- Variable number of planes (currently always uses all 4)
- Loop filtering
- Per-plane lossless quantizer
- ARNR support
This implementation just uses the same contents as the Y channel
for the A channel, due to lack of content and general pain in
playing back 4 channel content. A later patch will use the actual
alpha channel passed in from outside the codec.
Change-Id: Ibf81f023b1c570bd84b3064e9b4b8ae52e087592
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: I4ea09df0e162591e420d869b7431c2e7f89a8c1a
Change band calculation back to simpler model based
on the order in which coefficients are coded in scan order
not the absolute coefficient positions.
With the scatter scan experiment enabled the results were
appear broadly neutral on derf (-0.028) but up a little on std-hd +0.134).
Without the scatterscan experiment on the results were up derf as well.
Change-Id: Ie9ef03ce42a6b24b849a4bebe950d4a5dffa6791
This allows removing a large number of transform size specific functions,
as well as supporting 444/alpha by routing all code through the
subsampling-aware path.
Change-Id: Ieb085cebe9f37f24fc24de179898b22abfda08a4
Creates a common encode (subtract, transform, quantize, optimize,
inverse transform, reconstruct) function for all sb sizes, including
the old 16x16 path.
Change-Id: I964dff1ea7a0a5c378046a069ad83495f54df007
Work-in-progress, not yet ready for review. TODO items:
- bitstream writing (encoder) and reading (decoder)
- decoder reconstruction
Change-Id: I5afb7284e7e0480847b47cd0097cb469433c9081
Output changes slightly because of a minor bug in (at least) the sb32x16
block2above tx16x16 tables that previously existed in vp9_blockd.c.
Change-Id: I624af28ac200a8322d64454cf05c79e9502968cc
Basic assumption: when talking about transform units, use b_; when
talking about macroblock indices, use mb_.
Change-Id: Ifd163f595d4924ff892de4eb0401ccd56dc81884
All members can be referenced from their per-plane counterparts, and
removes assumptions about 24 blocks per macroblock.
Change-Id: I593fb0715e74cd84b48facd1c9b18c3ae1185d4b
This commit moves the coeff storage from the MACROBLOCK struct to its
per-plane part. The next commit will remove the coeff member from the
BLOCK structure so that it is consistently accessed per-plane.
Also refactors vp9_sb_block_error_c and vp9_sb_uv_block_error_c to be
variable subsampling aware.
Change-Id: I18c30f87f27c3a012119b6c1970d5fa499804455
First in a series of commits making certain MACROBLOCK members
addressable per-plane. This commit also refactors the block subtraction
functions vp9_subtract_b, vp9_subtract_sby_c, etc to be
loops-over-planes and variable subsampling aware.
Change-Id: I371d092b914ae0a495dfd852ea1a3d2467be6ec3
Adds an experiment that codes an end-of-orientation symbol
for every eligible zero encountered in scan order.
This cleans out various other sub-experiments that were part
of the origiinal patch, which will be later included if found
useful.
Results are slightly positive on all sets (0.1 - 0.2% range).
Change-Id: I57765c605fefc7fb9d1b57f1b356843602abefaf
Removes the redundant dst pointers from vp9_build_inter_predictors_sb{y,uv}
and the remaining mb specific functions.
Change-Id: I7b6bf439d9394b85ea79b4fe61a3ffc1025720da