The low precision 32x32 fdct has all the intermediate steps within
16-bit depth, hence allowing faster SSE2 implementation, at the
expense of larger round-trip error. It was used in the rate-distortion
optimization search loop only.
Using the low precision version, in replace of the high precision one,
affects the compression performance by about 0.7% (derf, stdhd) at
speed 0. For speed 1, it makes derf set down by only 0.017%.
Change-Id: I4e7d18fac5bea5317b91c8e7dabae143bc6b5c8b
This commit provides special handle on 16x16 inverse 2D-DCT, where
only DC coefficient is quantized to be non-zero value.
Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
This allows us to increment the position at the band-level only as
we go from one band to the next; more importantly, that allows us to
use an add instead of multiply instruction, and omit the instruction
altogether if the band doesn't change from one coef to the next, thus
being slightly faster (probably more noticeable on systems where a
multiply is expensive, like arm).
Change-Id: I4343fe35b9f9a47fa00b217bdcbf5f91ff96c381
This commit brought back the shortcut implementation of 8x8/16x16
inverse 2D-DCT. When the eob <= 10, it skips the inverse transform
operations on row 4:7/4:15 in the first round. For bus_cif at 1000
kbps, this provides about 2% speed-up at speed 0.
Change-Id: I453e2d72956467d75be4ad8c04b4482ab889d572
This commit enables a special handle for the 8x8 inverse 2D-DCT,
where only DC coefficient is quantized to be non-zero. For bus_cif
at 2000 kbps, it provides about 1% speed-up at speed 0.
Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
This commit makes the initialization of trellis coeff optimization
a per-plane operation, thereby eliminating the redundant steps in
encode_sby and encode_sbuv. It makes the encoder at speed 0 slightly
faster.
Change-Id: Iffe9faca6a109dafc0dd69dc7273cbdec19b17cd
Adding plane type check condition because it was always used outside of
get_tx_type_{4x4, 8x8, 16x16}.
Change-Id: I02f0bbfee8063474865bd903eb25b54d26e07230
The struct optimize_block_args is defined same as encode_b_args.
Remove this redundant definition, and use encode_b_args consistently.
Change-Id: I1703aeeb3bacf92e98a34f4355202712110173d9
The xform_quant() module is only used by inter modes, hence removing
the redundant switches therein conditioned on tx_type.
Change-Id: Ib87ce5b2f2e4cbf3ceb133a1108afa173c933a3f
When all the transform coefficients were quantized to zero, skip
the inverse transform operation. For bus_cif at 1000 kbps, the
runtime goes from 154967ms -> 149842ms, i.e., about 3% speed-up,
at speed 0.
Change-Id: Ic0a813fff5e28972d4888ee42d8747846a6c3cc6
many structures use bw and bh and they have different meanings. This cl attempts
to start this clean up and remove unneccessary 2 step look up log and then
shift operations...
also removed partition type multiple operation code in bitstream.c.
Change-Id: I7e03e552bdfc0939738e430862e3073d30fdd5db
Cycle times:
4x4: 151 to 131 cycles (15% faster)
8x8: 334 to 306 cycles (9% faster)
16x16: 1401 to 1368 cycles (2.5% faster)
32x32: 7403 to 7367 cycles (0.5% faster)
Total encode time of first 50 frames of bus @ 1500kbps (speed 0)
goes from 1min39.2 to 1min38.6, i.e. a 0.67% overall speedup.
Change-Id: I799a49460e5e3fcab01725564dd49c629bfe935f
Also inline some of the block calculations to assist the compiler to
not do silly things like calculating the same offset (or converting
between raster/transform block offset or block, mi and pixel unit)
many, many, many times.
Cycle times:
4x4: 584 -> 505 cycles (16% faster)
8x8: 1651 -> 1560 cycles (6% faster)
16x16: 7897 -> 7704 cycles (2.5% faster)
32x32: 16096 -> 15852 cycles (1.5% faster)
Overall, this saves about 0.5 seconds (1min49.8 -> 1min49.3) on the
first 50 frames of bus (speed 0) @ 1500kbps, i.e. 0.5% overall.
Change-Id: If3dd62453f8e2ab9d4ee616bc4ea956fb8874b80
Skip the inverse transform and reconstruction of inter-mode coded
blocks in the rate-distortion optimization loop, when skip_encode_sb
feature is turned on. This provides about 1% speed-up at speed 0,
and 1.5% speed-up at speed 1. No performance change in both settings.
Change-Id: I2932718bf4d007163702b61b16b6ff100cf9d007
This speed feature allows the encoder to largely remove the spatial
dependency between blocks inside a 64x64 superblock, thereby removing
the need to repeatedly encode superblocks per partition type in the
rate-distortion optimization loop.
A major challenge lies in the intra modes tested in the rate-distortion
optimization loop. The subsequent blocks do not have access to the
reconstructed boundary pixels without the intermediate coding steps.
This was resolved by using the original pixels for intra prediction
in the rd loop, followed by an appropriately designed distortion
modeling on the quantization parameters. Experiments also suggested
that the performance impact is more discernible at lower bit-rate/psnr
settings. Hence a quantizer dependent threshold is applied to deactivate
skip of block coding.
For bus_cif at 2000 kbps,
speed 0: runtime 269854ms -> 237774ms (12% speed-up) at 0.05dB
performance loss.
speed 1: runtime 65312ms -> 61536ms, (7% speed-up) at 0.04dB
performance loss.
This operation is currently turned on in settings of speed 1.
Change-Id: Ib689741dfff8dd38365d8c1b92860a3e176f56ec
The function encode_block is called only by inter-prediction modes,
hence removing the transform type branching there.
Change-Id: I34a3172e28ce2388835efd0f8781922211bff857
First 50 frames of bus @ 1500kbps (speed 0) goes from 2min12.6 to
2min11.6, i.e. 0.75% overall speedup.
Change-Id: I67054f8146e82a02b6457c51a1c8627a937e5e1e
Compute the rate-distortion cost per transformed block, and cumulate
the cost through all blocks inside a partition. This allows encoder
to detect if the cumulative rd cost is already above the best rd cost,
thereby enabling early termination in the rate-distortion optimization
search.
Change-Id: I0a856367a9a7b6dd0b466e7b767f54d5018d09ac
This should significantly speedup cost_coeffs(). Basically what the
patch does is to make the neighbour arrays padded by one item to
prevent an eob check in get_coef_context(), then it populates each
col/row scan and left/top edge coefficient with two times the same
neighbour - this prevents a single/double context branch in
get_coef_context(). Lastly, it populates neighbour arrays in pixel
order (rather than scan order), so we don't have to dereference the
scantable to get the correct neighbours.
Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.
Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.
Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
Makes cost_coeffs() a lot faster:
4x4: 236 -> 181 cycles
8x8: 888 -> 588 cycles
16x16: 3550 -> 2483 cycles
32x32: 17392 -> 12010 cycles
Total encode time of first 50 frames of bus (speed 0) @ 1500kbps goes
from 2min51.6 to 2min43.9, i.e. 4.7% overall speedup.
Change-Id: I16b8d595946393c8dc661599550b3f37f5718896
Cycle timings for first 3 frames of bus (speed 0) at 1500kbps:
4x4: 298 -> 234 cycles
8x8: 1227 -> 878 cycles
16x16: 23426 -> 18134 cycles
32x32: 4906 -> 3664 cycles
Total encode time of first 50 frames of bus @ 1500kbps (speed 0) goes
from 3min0.7 to 2min51.6 seconds, i.e. 5.3% faster.
Change-Id: I68a0e1b530b0563b84a67342cca4b45146077e95
This commit enables configurable reference buffer pointer for intra
predictor. This allows later removal of spatial dependency between
blocks inside a 64x64 superblock in the rate-distortion optimization
loop.
Change-Id: I02418c2077efe19adc86e046a6b49364a980f5b1
This commit makes use of dual fdct32x32 versions for rate-distortion
optimization loop and encoding process, respectively. The one for
rd loop requires only 16 bits precision for intermediate steps.
The original fdct32x32 that allows higher intermediate precision (18
bits) was retained for the encoding process only.
This allows speed-up for fdct32x32 in the rd loop. No performance
loss observed.
Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
Change the argument of get_uv_tx_size() to be an MBMI pointer, so that the
correct column's MBMI can be passed to the function.
Change-Id: Ied6b8ec33b77cdd353119e8fd2d157811815fc98
Code intra/inter, then comp/single, then the ref frame selection.
Use contextualization for all steps. Don't code two past frames
in comp pred mode.
Change-Id: I4639a78cd5cccb283023265dbcc07898c3e7cf95
This avoids encoding tokens for blocks that are entirely
in the UMV border. This changes the bitstream.
Change-Id: I32b4df46ac8a990d0c37cee92fd34f8ddd4fb6c9
Migrates costing changes/fixes from the rebalance expt to the head
without the expt on.
Rebased.
Change-Id: I51677d62f77ed08aca8d21a4c9a13103eb8de93f
Results:
derfraw300: +0.126%
This patch changes the coefficient tree to move the EOB to below
the ZERO node in order to save number of bool decodes.
The advantages of moving EOB one step down as opposed to two steps down
in the other parallel patch are: 1. The coef modeling based on
the One-node becomes independent of the tree structure above it, and
2. Fewer conext/counter increases are needed.
The drawback is that the potential savings in bool decodes will be
less, but assuming that 0s are much more predominant than 1's the
potential savings is still likely to be substantial.
Results on derf300: -0.237%
Change-Id: Ie784be13dc98291306b338e8228703a4c2ea2242
Proposal for tuning the residual coding by changing how the context
from previous tokens is calculated. Storing the energy class of previous
tokens instead of the token itself eases the critical path of
HW implementations.
Change-Id: I6d71d856b84518f6c88de771ddd818436f794bab
Removed one 4x4 prediction step that was unnessary in the rd loop.
Removed a unused modecosts estimate from encoder side.
Change-Id: I65221a52719d6876492996955ef04142d2752d86
1. remove prediction mode conversion
2. unified bmode, same for key and non-key frame
3. set I4X4_PRED count for pdf to 0, as I4X4_PRED is no longer
coded ever. It is determined by ref_frame and block partition
Change-Id: If5b282957c24339b241acdb9f2afef85658fe47d
This commit changed the encoding and decoding of intra blocks to be
based on transform block. In each prediction block, the intra coding
iterates thorough each transform block based on raster scan order.
This commit also fixed a bug in D135 prediction code.
TODO next:
The RD mode/txfm_size selection should take this into account when
computing RD values.
Change-Id: I6d1be2faa4c4948a52e830b6a9a84a6b2b6850f6
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22
This is a mostly-working implementation of an extra channel in the
bitstream. Configure with --enable-alpha to test. Notable TODOs:
- Add extra channel to all mismatch tests, PSNR, SSIM, etc
- Configurable subsampling
- Variable number of planes (currently always uses all 4)
- Loop filtering
- Per-plane lossless quantizer
- ARNR support
This implementation just uses the same contents as the Y channel
for the A channel, due to lack of content and general pain in
playing back 4 channel content. A later patch will use the actual
alpha channel passed in from outside the codec.
Change-Id: Ibf81f023b1c570bd84b3064e9b4b8ae52e087592
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.
Change-Id: I4ea09df0e162591e420d869b7431c2e7f89a8c1a
Change band calculation back to simpler model based
on the order in which coefficients are coded in scan order
not the absolute coefficient positions.
With the scatter scan experiment enabled the results were
appear broadly neutral on derf (-0.028) but up a little on std-hd +0.134).
Without the scatterscan experiment on the results were up derf as well.
Change-Id: Ie9ef03ce42a6b24b849a4bebe950d4a5dffa6791
This allows removing a large number of transform size specific functions,
as well as supporting 444/alpha by routing all code through the
subsampling-aware path.
Change-Id: Ieb085cebe9f37f24fc24de179898b22abfda08a4
Creates a common encode (subtract, transform, quantize, optimize,
inverse transform, reconstruct) function for all sb sizes, including
the old 16x16 path.
Change-Id: I964dff1ea7a0a5c378046a069ad83495f54df007
Work-in-progress, not yet ready for review. TODO items:
- bitstream writing (encoder) and reading (decoder)
- decoder reconstruction
Change-Id: I5afb7284e7e0480847b47cd0097cb469433c9081
Output changes slightly because of a minor bug in (at least) the sb32x16
block2above tx16x16 tables that previously existed in vp9_blockd.c.
Change-Id: I624af28ac200a8322d64454cf05c79e9502968cc
Basic assumption: when talking about transform units, use b_; when
talking about macroblock indices, use mb_.
Change-Id: Ifd163f595d4924ff892de4eb0401ccd56dc81884
All members can be referenced from their per-plane counterparts, and
removes assumptions about 24 blocks per macroblock.
Change-Id: I593fb0715e74cd84b48facd1c9b18c3ae1185d4b
This commit moves the coeff storage from the MACROBLOCK struct to its
per-plane part. The next commit will remove the coeff member from the
BLOCK structure so that it is consistently accessed per-plane.
Also refactors vp9_sb_block_error_c and vp9_sb_uv_block_error_c to be
variable subsampling aware.
Change-Id: I18c30f87f27c3a012119b6c1970d5fa499804455
First in a series of commits making certain MACROBLOCK members
addressable per-plane. This commit also refactors the block subtraction
functions vp9_subtract_b, vp9_subtract_sby_c, etc to be
loops-over-planes and variable subsampling aware.
Change-Id: I371d092b914ae0a495dfd852ea1a3d2467be6ec3
Adds an experiment that codes an end-of-orientation symbol
for every eligible zero encountered in scan order.
This cleans out various other sub-experiments that were part
of the origiinal patch, which will be later included if found
useful.
Results are slightly positive on all sets (0.1 - 0.2% range).
Change-Id: I57765c605fefc7fb9d1b57f1b356843602abefaf
Removes the redundant dst pointers from vp9_build_inter_predictors_sb{y,uv}
and the remaining mb specific functions.
Change-Id: I7b6bf439d9394b85ea79b4fe61a3ffc1025720da
First in a series of commits moving the framebuffers pointers to
per-plane data, so that they can be indexed numerically rather than
by name.
Change-Id: I6e0d60fd4d51e6375c384eb7321776564df21775
Use in-place buffers (dst of MACROBLOCKD) for macroblock prediction.
This makes the macroblock buffer handling consistent with those of
superblock. Remove predictor buffer MACROBLOCKD.
Change-Id: Id1bcd898961097b1e6230c10f0130753a59fc6df
More specifically, remove vp9_quantize_mb*, vp9_optimize_mb*,
vp9_inverse_transform_mb* and vp9_transform_mb*. Instead, use the
generic _sb* functions that take a size argument, and call them with
BLOCK_SIZE_MB16X16.
Change-Id: I33024afea95d3a23ffbc1df7da426e4645110f29
Merge sb32x32 and sb64x64 functions; allow for rectangular sizes. Code
gives identical encoder results before and after. There are a few
macros for rectangular block sizes under the sbsegment experiment; this
experiment is not yet functional and should not yet be used.
Change-Id: I71f93b5d2a1596e99a6f01f29c3f0a456694d728
Start grouping data per-plane, as part of refactoring to support
additional planes, and chroma planes with other-than 4:2:0
subsampling.
Change-Id: Idb76a0e23ab239180c818025bae1f36f1608bb23
The patch adds the flexibility to use standard EOB based coding
on smaller block sizes and nzc based coding on larger blocksizes.
The tx-sizes that use nzc based coding and those that use EOB based
coding are controlled by a function get_nzc_used().
By default, this function uses nzc based coding for 16x16 and 32x32
transform blocks, which seem to bridge the performance gap
substantially.
All sets are now lower by 0.5% to 0.7%, as opposed to ~1.8% before.
Change-Id: I06abed3df57b52d241ea1f51b0d571c71e38fd0b
These are mostly just for experimental purposes. I saw small gains (in
the 0.1% range) when playing with this on derf.
Change-Id: Ib21eed477bbb46bddcd73b21c5c708a5b46abedc
Now that the first AC coefficient in both directions use the same DC
as their context, there no longer is a purpose in letting both have
their own band. Merging these two bands allows us to split bands for
some of the very high-frequency AC bands.
In addition, I'm redoing the banding for the 1D-ADST col/row scans. I
don't think the old banding made any sense at all (it merged the last
coefficient of the first row/col in the same band as the first two of
the second row/col), which was clearly an oversight from the band being
applied in scan-order (rather than in their actual position). Now,
coefficients at the same position will be in the same band, regardless
what scan order is used. I think this makes most sense for the purpose
of banding, which is basically "predict energy for this coefficient
depending on the energy of context coefficients" (i.e. pt).
After full re-training, together with previous patch, derf gains about
1.2-1.3%, and hd/stdhd gain about 0.9-1.0%.
Change-Id: I7a0cc12ba724e88b278034113cb4adaaebf87e0c
Pearson correlation for above or left is significantly higher than for
previous-in-scan-order (absolute values depend on position in scan, but
in general, we gain about 0.1-0.2 by using either above or left; using
both basically just makes this even better). For eob branch skipping,
we continue to use the previous token in scan order.
This helps about 0.9% on derf after re-training on a limited data set.
Full re-training and results on larger-resolution clips are pending.
Note that this commit breaks trellis, so we can probably get further
gains out of it by fixing trellis at some later point.
Change-Id: Iead68e296fc3a105cca746b5e3da9555d6010cfe
This also changes the RD search to take account of the correct block
index when searching (this is required for ADST positioning to work
correctly in combination with tx_select).
Change-Id: Ie50d05b3a024a64ecd0b376887aa38ac5f7b6af6
This patch revamps the entropy coding of coefficients to code first
a non-zero count per coded block and correspondingly remove the EOB
token from the token set.
STATUS:
Main encode/decode code achieving encode/decode sync - done.
Forward and backward probability updates to the nzcs - done.
Rd costing updates for nzcs - done.
Note: The dynamic progrmaming apporach used in trellis quantization
is not exactly compatible with nzcs. A suboptimal approach has been
used instead where branch costs are updated to account for changes
in the nzcs.
TODO:
Training the default probs/counts for nzcs
Change-Id: I951bc1e22f47885077a7453a09b0493daa77883d
Split macroblock and superblock tokenization and detokenization
functions and coefficient-related data structs so that the bitstream
layout and related code of superblock coefficients looks less like it's
a hack to fit macroblocks in superblocks.
In addition, unify chroma transform size selection from luma transform
size (i.e. always use the same size, as long as it fits the predictor);
in practice, this means 32x32 and 64x64 superblocks using the 16x16 luma
transform will now use the 16x16 (instead of the 8x8) chroma transform,
and 64x64 superblocks using the 32x32 luma transform will now use the
32x32 (instead of the 16x16) chroma transform.
Lastly, add a trellis optimize function for 32x32 transform blocks.
HD gains about 0.3%, STDHD about 0.15% and derf about 0.1%. There's
a few negative points here and there that I might want to analyze
a little closer.
Change-Id: Ibad7c3ddfe1acfc52771dfc27c03e9783e054430