(1) Refines the modeling function and uses that to add some speed
features. Specifically, intead of using a flag use_largest_txfm as
a speed feature, an enum tx_size_search_method is used, of which
two of the types are USE_FULL_RD and USE_LARGESTALL. Two other
new types are added:
USE_LARGESTINTRA (use largest only for intra)
USE_LARGESTINTRA_MODELINTER (use largest for intra, and model for
inter)
(2) Another change is that the framework for deciding transform type
is simplified to use a heuristic count based method rather than
an rd based method using txfm_cache. In practice the new method
is found to work just as well - with derf only -0.01 down.
The new method is more compatible with the new framework where
certain rd costs are based on full rd and certain others are
based on modeled rd or are not computed. In this patch the existing
rd based method is still kept for use in the USE_FULL_RD mode.
In the other modes, the count based method is used.
However the recommendation is to remove it eventually since the
benefit is limited, and will remove a lot of complications in
the code
(3) Finally a bug is fixed with the existing use_largest_txfm speed feature
that causes mismatches when the lossless mode and 4x4 WH transform is
forced.
Results on derf:
USE_FULL_RD: +0.03% (due to change in the tables), 0% encode time reduction
USE_LARGESTINTRA: -0.21%, 15% encode time reduction (this one is a
pretty good compromise)
USE_LARGESTINTRA_MODELINTER: -0.98%, 22% encode time reduction
(currently the benefit of modeling is limited for txfm size selection,
but keeping this enum as a placeholder) .
USE_LARGESTALL: -1.05%, 27% encode-time reduction (same as existing
use_largest_txfm speed feature).
Change-Id: I4d60a5f9ce78fbc90cddf2f97ed91d8bc0d4f936
This should significantly speedup cost_coeffs(). Basically what the
patch does is to make the neighbour arrays padded by one item to
prevent an eob check in get_coef_context(), then it populates each
col/row scan and left/top edge coefficient with two times the same
neighbour - this prevents a single/double context branch in
get_coef_context(). Lastly, it populates neighbour arrays in pixel
order (rather than scan order), so we don't have to dereference the
scantable to get the correct neighbours.
Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.
Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
2min10.1, i.e. a 2.3% overall speed increase.
Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.
Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
since:
92479d9 Make update_partition_context faster
fixes:
vp9/common/vp9_blockd.h:408:22: error:
non-constant-expression cannot be narrowed from type 'int' to 'char' in
initializer list [-Wc++11-narrowing]
char pcvalue[2] = {~(0xe << boffset), ~(0xf <<boffset)};
^~~~~~~~~~~~~~~~~
Change-Id: Id5b00b9a72d00a2b314081a23879bd1fa3ce983b
This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.
Change-Id: Iad4d526346e05c7be896466c05500711bb763660
Makes cost_coeffs() a lot faster:
4x4: 236 -> 181 cycles
8x8: 888 -> 588 cycles
16x16: 3550 -> 2483 cycles
32x32: 17392 -> 12010 cycles
Total encode time of first 50 frames of bus (speed 0) @ 1500kbps goes
from 2min51.6 to 2min43.9, i.e. 4.7% overall speedup.
Change-Id: I16b8d595946393c8dc661599550b3f37f5718896
Adding CHECK_MEM_ERROR macro to vp9_common.h and removing two duplicated
ones from vp9_onyx_int.h and vp9_onyxd_int.h.
Change-Id: I916afec61b3019f18193135dac7c35ed0f89b8b6
This commit replaces zrun_zbin_boost, a method of biasing non-zero
coefficients following runs of zero-coefficients to be rounded towards
zero, with an explicit skip-block choice in the RD loop.
The logic is basically that if individual coefficients should be rounded
towards zero (from a RD point of view), the trellis/optimize loop should
take care of it. If whole blocks should be zero (from a RD point of
view), a single RD check is much more efficient than a complete
serialization of the quantization loop.
Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
SIMD for quantize will follow in a separate patch. Results for other
test sets pending.
Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
Using vp9_set_pred_flag function instead of custom code, adding
decode_tokens function which is now called from decode_atom,
decode_sb_intra, and decode_sb.
Change-Id: Ie163a7106c0241099da9c5fe03069bd71f9d9ff8
- Added vp9_loop_filter_horizontal_edge_neon and
vp9_loop_filter_vertical_edge_neon.
- The functions are based off the vp8 loopfilter
functions.
- Matches x86 md5 checksum.
Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0
This commit enables configurable reference buffer pointer for intra
predictor. This allows later removal of spatial dependency between
blocks inside a 64x64 superblock in the rate-distortion optimization
loop.
Change-Id: I02418c2077efe19adc86e046a6b49364a980f5b1
Use vpx_memset for updating the partition contexts. Thanks to Noah
for pointing out the need of refactoring in this part.
Change-Id: I67fb78429d632298f1cd8a0be346cc76f79392a6
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.
Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.
The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.
Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
This reduced the size of the MODE_INFO array (mip and prev_mip)
by 425,568 bytes each for 1080p resolutions.
Change-Id: Ifa513ec2d0a49e8ec0867ec90620762fb7f1261d
For cases where there's no transform set in bit 0 (the left edge of
the SB) but bit 0 of mask_4x4_int is set (the edge 4 pixels from the
left edge needs filtering), it was incorrectly being skipped before.
This situation only happens on the leftmost edge of the image, as
the edge at column 0 is intentionally skipped since there aren't
pixels to the left to read.
Change-Id: Ib2fbbcb40166e90af31b1a0e13b85b68c226cbd3
Change vp9_block_error() to return a 64bit error variable, change all
callers to expect a 64bit return value (this will prevent overflows,
which we basically don't check for at all right now). Remove duplicate
block_error() function, which fixed that through truncation. Remove
old (incompatible) mmx/sse2 block_error SIMD versions and replace with
a new one that returns a 64bit value.
Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
3min23, i.e. a 3% overall speedup.
Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
perfectly interleaved, and can probably be improved further in the
future. I've marked this with a few TODOs/FIXMEs in the code.
Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
The new print out includes skips and has prefixed sections so you can
grep to find things like transforms chosen on each frame.
Change-Id: I195043424647d9514cfc3ff6720a5b20d010fa1b
This commit makes use of dual fdct32x32 versions for rate-distortion
optimization loop and encoding process, respectively. The one for
rd loop requires only 16 bits precision for intermediate steps.
The original fdct32x32 that allows higher intermediate precision (18
bits) was retained for the encoding process only.
This allows speed-up for fdct32x32 in the rd loop. No performance
loss observed.
Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
This seems to only be used in the encoder. Also remove an empty wrapper
file that contained forward declarations for this function, but didn't
actually define any actual functions.
Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b
vp9_default_inter_mode_probs was being accessed with a different type
than it was defined with. Ensure that its declaration is included
prior to its definition.
Change-Id: I2f963f513ab2f4e339f8a3c17e3d0f03749eba16
All elements of this table are equal to 252, so replace it with a
single constant VP9_COEF_UPDATE_PROB.
Change-Id: I1e2d1d284326ce6df9899a740c2fc344b3ec81c9
The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.
Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.
Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
Modified to work with 8x8 blocks of memory. Will revisit
later for further optimizations. For the HD clip used, the
decoder improved by almost 20%.
Change-Id: Iaa4785be293a32a42e8db07141bd699f504b8c67
Modified to work with 8x8 blocks of memory. Will revisit
later for further optimizations. For the HD clip used, the
decoder improved my 20%.
Change-Id: Ia0057f55d66d1445882351ea6c43b595a5a980e5
This commit fixed the allowable partition types for bottom-right
corner blocks.
When a block has over half of its pixels as valid content in both
vertical and horizontal directions, allow all the four partition
types in the bit-stream. Otherwise, apply partition type constraints.
Change-Id: I2252e2de7125a8bfb1c824bf34299a13c81102e3
* New probs for subpel filters/tx_count
* Makes a change to not reset to defaults for the tx_size
probs if an intermediate frame reverts to using a fixed tx_size.
* A few updates to the parameters for backward adaptation for mode/mv
* some cosmetic cleanups
derf300: +0.06%
Change-Id: I22994d659bc31ca7a4fc8820fde24001e64a2920
Remove the bilinear filter mode, and the no-loopfilter mode, and the
related vp9_setup_version() function.
Change-Id: I32311367812faf37863131df3af37d63d03973d7
This commit has no impact but to help us debug issues. To Use call like
this:
vp9_print_modes_and_motion_vectors(cpi->common.mi, cpi->common.mi_rows,
cpi->common.mi_cols,
cpi->common.current_video_frame,
"decode_mi.stt");
Change-Id: I89e27725dae351370eb7f311a20a145ed4f1d041
No bitstream change.
Removes unused filters and the code for the case of 2 switchable filters;
also changes the 8tap-smooth filter coefficients for integer shifts to be
interpolating to be consistent with the way it is implemented currently.
Change-Id: I96c542fd8c06f4e0df507a645976f58e6de92aae
Implements ability to signal and decode frames that are
encoded using only intra coding modes. Only the decode
side has been implemented here.
Change-Id: I53ac6a8d90422cd08ba389e5236e15b45f9e93de
Change the argument of get_uv_tx_size() to be an MBMI pointer, so that the
correct column's MBMI can be passed to the function.
Change-Id: Ied6b8ec33b77cdd353119e8fd2d157811815fc98
Fixed point scaling factors are calculated once for each
reference frame by using integer division. Otherwise fixed point
scaling routines are used in all scaling calculations. This makes it
possible to calculate fixed point scaling factors on device driver
software and pass them to hardware and thus avoid division on hardware.
TODO:
- Missing check for maximum frame dimensions
(currently scaling uses 14 bits)
- Missing check for maximum scaling ratio
(upscaling 16:1, downscaling 2:1)
Problems:
- Straightforward fixed point implementation can cause error +-1
compared to integer division (i.e. in x_step_q4). Should only
be an issue for frames larger than 16k.
Change-Id: I3cf4dabd610a4dc18da3bdb31ae244ebaf5d579c
Was always using sb_type of first column in a row of 8x8 units when
determining decoded block edges as a subcondition for loop filter
skipping.
Change-Id: Ib17554633a63a90b70cdaa7bed65db035a8ad9d8
Reduces TX_SIZE contexts to 2 for each kind. The code is
cleaner and there is hardly any performance difference with
more than two contexts.
Results: almost neutral
Change-Id: I17656bd6db76224ae2856adf882504560e7dbaa4
Made changes to the frame header to write the sync
code in the frame header for a non-displayable,
intra-only frame.
Extended reset_frame_context to 2-bits.
(Submitting on behalf of Dmitri)
Change-Id: Ie836ae0df9ed572fb4f08aabe9351a555c4f3b96
Adds coding of transform size within a frame by use of context
of transform sizes selected in left and above blocks.
Also incorporates code for generating stats.
TODO: generate and incorporate new default stats
Change-Id: I6a7af099f6ad61d448521d9a51167aedaf638ed6
Refactors mbskip coding to be compatible with coding of the rest of
the symbols. Adds forward/backward adaptation and removes a lot of
the legacy code.
Results:
fast50: +1.6%
derfraw300: +0.317%
Change-Id: I395a2976d15af044d3b8ded5acfa45f6f065f980
The partition types of blocks sitting on the frame boundary are
constrained by the block size and the position of each sub-block
relative to the frame. Hence we use truncated probability models
to handle the coding of such information.
100 frames run:
yt 0.138%
Change-Id: I85d9b45665c15280069c0234ea6f778af586d87d
This can only happen if partition is partly out-of-frame, in which
case the referenced mv is either out-of-frame also (and thus has the
same value as an already-read one), or it is actually uninitialized,
in which case we don't want to use it.
Change-Id: Icf39fa4d987c7abcbebb9bbdcdd6311e8fb9d3c9
With the removal of i4X4 and SPLIT_MV modes, the two entries for the
modes are no longer used. This patch remove the coding of the deltas.
Change-Id: Iea4eb500404ebe9706159380a03b8eca542fb4c3
Changes to the coding of transform sizes, along with forward
and backward probability updates.
Results:
derf300: +0.241%
Context based coding of transform sizes will be in a separate
patch.
Change-Id: I97241d60a926f014fee2de21fa4446ca56495756
Added condition to not to skip filtering of transform block edges when
the edge is also a decoding block edge.
Change-Id: Iaccb6206c4202b78e5dca3b89379556e0f4aba0c
Wrong max data size (skip has no data) and use of vp9_get_segdata()
when it should be vp9_segfeature_active().
Change-Id: I1eb97d33df6e2a42cc589049f704266fe3639902
ref_frame in MB_Mode_Info was changed in the ref frame coding patch
to be an array to handle first and second reference frame, this patch
fix the loop filter code that use the pointer directly as reference
frame.
Change-Id: I71afa5a49deb50c1bc38029fd07470b984c6dfe9
Code intra/inter, then comp/single, then the ref frame selection.
Use contextualization for all steps. Don't code two past frames
in comp pred mode.
Change-Id: I4639a78cd5cccb283023265dbcc07898c3e7cf95
Split partition probabilities between keyframes and non-keyframes,
since they are fairly different. Also have per-blocksize interframe
y intramode probabilities, since these vary heavily between different
blocksizes.
Lastly, replace default probabilities for partitioning and intra modes
with new ones generated from current codec. Replace counts with actual
probabilities also.
Change-Id: I77ca996e25e4a28e03bdbc542f27a3e64ca1234f
This version of the loop filter supports non-4:2:0 subsampling and
a fourth plane, as well as changing the filtering order to be more
friendly to hardware implementations.
The filters are applied first to all vertical edges within the
64x64 SB, followed by the top horizontal edge and any internal
horizontal edges. Since filtering is applied on each 4x4 edge
serially, a dependency is created from filtering one block edge
to the next. It would be possible to remove this depencnecy by
building all filtering decisions from the unfiltered
reconstruction data.
Change-Id: I08f3e9683eb7bded8a76651cbc50fc0dfdd05fa7
This avoids encoding tokens for blocks that are entirely
in the UMV border. This changes the bitstream.
Change-Id: I32b4df46ac8a990d0c37cee92fd34f8ddd4fb6c9
This commit makes the coding/reconstruction operations of intra
coding rate-distortion loop for UV components consistent with those
of the encoding process.
key frame coding gains:
derf: 0.11%
stdhd: 0.42%
Change-Id: I8d49f83924a320e3689ef2d60096c49d7f0c7a40
Adds backward adaptation and differential forward updates of switchable
interpolation filter probabilities. Also adds some cosmetic cleanups
and minor fixes on mv_ref probabilities.
derfraw300: +0.353% (with most coming from switchable interp changes)
Change-Id: Ie2718be73528c945fd0d80cfd63ca2d9cb3032de
Restrict get_matching_candidate() to considering
mvs at 8x8 and larger sizes for last frame case.
This is to reduce the HW load of using vectors down
to the 4x4 level from the previous frame.
Change-Id: I6505e610fd63a4e22d67f136aec7905a01b893ba
This speed 1 - uses variance threshold stolen from static-thresh
to determine split. Any superblock with greater than the variance
set by static thresh * quantizer index squared is split. In addition
transform size is set to largest size less than or equal to partition
size, sub pixel filter is set to normal, and only 12 modes are used
at all.
Change-Id: If7a2858ee70f96d1eb989c04fd87a332b147abef
We leave it in rdopt.c as a local define for now - this can be removed
later. In all other places, we remove it, thereby slightly decreasing
the size of some arrays in the bitstream.
Change-Id: Ic2a9beb97a4eda0b086f62c039d994b192f99ca5
It remains as a local define in rdopt.c so we can distinguish between
split and non-split modes in the RD loop, but disappears outside that
scope in the codec.
Change-Id: I98c18fe5ab7e4fbd1d6620ec5695e2ea20513ce9
The commit changed to use a new variant of Walsh-Hadamard Transform
by Tim Terriberry. This new variant has the best compression among a
number of variants that developed by Tim.
Change-Id: Icb3a88515463cfc644b17ca046fcd139db2557e9
The first 240 coeff positions (15 top-left blocks) are scanned in the
same order as in scatter scan, after that the coeffs are scanned in
"block bands", each band at a time, all coeffs in one band before
moving on to the next band. This brings down the amount of 4x4 coeff
blocks that need to be buffered while scanning, from 15 blocks to 8 blocks.
Change-Id: I478a991d63c48bd5e64d36e59fed7a00c9a651ba
This patch removes the implicit segmentation
experiment from the code base as the benefits
were still unproven as of the bitstream deadline.
Change-Id: I273b99d8d621d1853eac4182f97982cb5957247e
Added two flags to the frame header:
intra_only:
Signals that the frame is encoded using only INTRA
coding modes.
reset_frame_context:
Indicates that the coding context specified
in the frame header should be reset to default values before the
frame is encoded/decoded.
Change-Id: I182d46f1f84fb67a13c46ad767f246a38d7861a2
This patch changes the coefficient tree to move the EOB to below
the ZERO node in order to save number of bool decodes.
The advantages of moving EOB one step down as opposed to two steps down
in the other parallel patch are: 1. The coef modeling based on
the One-node becomes independent of the tree structure above it, and
2. Fewer conext/counter increases are needed.
The drawback is that the potential savings in bool decodes will be
less, but assuming that 0s are much more predominant than 1's the
potential savings is still likely to be substantial.
Results on derf300: -0.237%
Change-Id: Ie784be13dc98291306b338e8228703a4c2ea2242
This patch checks at the frame level to see if the previous
mode info context can be used. This patch eliminates the
flag check that was done for every mode and removes another
check that was done prior to every vp9_find_mv_refs().
Change-Id: I9da5e18b7e7e28f8b1f90d527cad087073df2d73
Proposal for tuning the residual coding by changing how the context
from previous tokens is calculated. Storing the energy class of previous
tokens instead of the token itself eases the critical path of
HW implementations.
Change-Id: I6d71d856b84518f6c88de771ddd818436f794bab
Adding API to read/write uncompressed frame header bits (it is not final
yet). Separate functions to read/write uncompressed header. Moving
clr_type, error_resilient_mode, refresh_frame_context,
frame_parallel_decoding_mode, frame_context_idx from compressed partition
to uncompressed frame header.
Change-Id: Id3ed8a387980c652ae147549412f4ec24a0a5bd0
Removed one 4x4 prediction step that was unnessary in the rd loop.
Removed a unused modecosts estimate from encoder side.
Change-Id: I65221a52719d6876492996955ef04142d2752d86
1. remove prediction mode conversion
2. unified bmode, same for key and non-key frame
3. set I4X4_PRED count for pdf to 0, as I4X4_PRED is no longer
coded ever. It is determined by ref_frame and block partition
Change-Id: If5b282957c24339b241acdb9f2afef85658fe47d
This commit removes the use of bmi_ in the first-pass encoding by
forcing encode_intra4x4block_ to use DC_PRED, followed by DCT_DCT
only, as John suggested. This makes the need for bmi buffer only
up to 4 entries, instead of 16.
Change-Id: I3410007dfae789ee46a09ae20c39d3ce3c7954aa
Also do per-partition motion vector referencing in <sb8x8 partitions,
and adjust mvref finding for sub8x8 partitions.
Change-Id: Id3ed1ed4d2a8910d11d327db6cc63b8eb79f941f
The changing in intra coding to base on transform block, i.e. pred->
txfm->quant->dequant-itxfm->recon, made all blocks within a prediction
unit behave consistently, there is no longer a need to handle blocks
differently based on the position within a predicitn block. So this
commit simplifies the decision of transform type to be based on
prediction mode only.
Change-Id: If96cb72386f2e9186126ace88afa35ef085b6c96
Move 4x4/4x8/8x4 partition coding out of experimental list.
This commit fixed the unit test failure issues. It also resolved
the merge conflicts between 4x4 block level partition and iterative
motion search for comp_inter_inter.
Change-Id: I898671f0631f5ddc4f5cc68d4c62ead7de9c5a58
Reverts to using 128 bit LUT for the coef models rather than 48
to ease hardware implementation.
Also incorporates some cleanups including removing various
hooks to support different lookup tables based on block_type and
ref_type.
Change-Id: I54100c120cca07a2ebd3a7776bc4630fa6a153f6
This commit changed the encoding and decoding of intra blocks to be
based on transform block. In each prediction block, the intra coding
iterates thorough each transform block based on raster scan order.
This commit also fixed a bug in D135 prediction code.
TODO next:
The RD mode/txfm_size selection should take this into account when
computing RD values.
Change-Id: I6d1be2faa4c4948a52e830b6a9a84a6b2b6850f6