Use vpx_blend_a64_hmask and vpx_blend_a64_vmask to speed up
computing the obmc predictor. Clean up calc_target_weighted_pred.
Encoder speedup: 1.3%
Decoder speedup: 6.5%
Change-Id: I0c774fe53d22399e92a10d1daf3af0010d88d2c5
Use vpx_blend_a64_hmask and vpx_blend_a64_vmask to speed up
computing the supertx predictor.
Decoder speedup of up to 4% has been observed.
Change-Id: I255a5ba4cc24f78dc905d25b6e2f7fbafac13253
- Made source buffers pointers to const.
- Renamed vpx_blend_mask6b to vpx_blend_a64_mask. This is more
indicative that the function does alpha blending. The 6, or 6b
suffix was misleading, as the max mask value (64) does not fit into
6 bits.
- Added VPX_BLEND_* macros to use when needing to blend scalars.
- Use VPX_BLEND_A256 in combine_interintra to be more explicit about
the operation being done.
- Added versions of vpx_blend_a64_* which take 1D horizontal/vertical
masks directly and apply them to all rows/columns
(vpx_blend_a64_hmask and vpx_blend_a64_vmask). The SSE4.1 optimzied
horizontal version now falls back on the 2D version. This can be
improved upon if it show up high enough in a profile.
- All vpx_blend_a64_* functions now support block sizes down to 1x1
(ie: a single pixel). This is for usage convenience. The SSE4.1
optimized versions fall back on the C implementation if
w <= 2 or h <= 2. This can again be improved if it becomes hot code.
Change-Id: I13ab3835146ffafe3e1d74d8e9cf64a5abe4144d
This reinstates commit efda2831e5f758b4f350679b5c55c0b9282449b0
without the tests and with fixes for 32 bit x86 builds.
Change-Id: I34be4fe1e8a67686d26ba256fd7efe0eb6a569e8
This commit makes the dual filter experiment work with non-420
settings. It fixes unit test failure in EndToEndTestLarge.
Change-Id: I04f7afdee78f91389d9ff72947efa152098af930
This reverts commit efda2831e5f758b4f350679b5c55c0b9282449b0.
This commit causes segmentation fault at SSE2/SumSquares2DTest.RandomValues/0
Change-Id: I171937e4daf6f15323e8206418773deb03bd8c53
We can optimize wedge partition selection by pre-computing the
residuals of the 2 underlying predictors, and then blend these
to compute the sse of the compound predictor, without actually
having to compute and subtract the compound predictor.
Similarly we can pre-compute a proxy array which we can use to
cheaply check which mask sign would have lower sse.
Details are in wedge_utils.c.
Mathematically these are equivalence transformations, but due to the
finite precision the encoder output will be perturbed, though on
average this should make 0% difference.
ext-inter gains about ~4.5% speedup.
Change-Id: Ib2657c3209ae161b4090b58b4b6c392641bf2792
This is purely a refactoring patch and has no functional effect.
Uses of these masks can be arranged such that all input blocks are
contiguous in memory (stride == block width). In this case 1D versions
of operations can be used. 1D vector operations have superior performance
over 2D block equivalents as they are more processor cache friendly and
they can do away with a second loop overhead.
Change-Id: I2b76c9888aea2c857cc497e8a4b2841fd3dad54e
Use the same rounding method that is used throughout the codebase,
where the halfway value is rounded up rather than down.
Change-Id: I04e92850bc69a7d7a07b06e3d2ce97f6f2ada321
Removing redundant calls to memcpy from
build_wedge_inter_predictor_from_buf yields a net 4% encoder speedup
with ext-inter only. The output is identical.
Change-Id: If97d4e323a5c8aca90c84a25a72085e006b05446
This is to replace vp10/common/reconinter.c:build_masked_compound.
Functionality is equivalent, but the interface is slightly more
generic.
Total encoder speedup with ext-inter: ~7.5%
Change-Id: Iee18b83ae324ffc9c7f7dc16d4b2b06adb4d4305
This commit makes the sub8x8 chroma component inter predictor
operate at 2x2 block level. This allows one to use the actual motion
vector associated with each individal pixel block. It improves the
compression performance
lowres 0.40%
midres 0.25%
hdres 0.15%
Change-Id: Ia40e07cc7fde463dbf660018850e024932136c4f
Increases number of wedges for smaller block and removes
wedge coding mode for blocks larger than 32x32.
Also adds various other enhancements for subsequent experimentation,
including adding provision for multiple smoothing functions
(though one is used currently), adds a speed feature that decides
the sign for interinter wedges using a fast mechanism, and refactors
wedge representations.
lowres: -2.651% BDRATE
Most of the gain is due to increase in codebook size for 8x8 - 16x16.
Change-Id: I50669f558c8d0d45e5a6f70aca4385a185b58b5b
Improves performance by 0.2%
lowres: -2.052% BDRATE
Also increases precision of the shift parameters (for further
investigation into different wedge shifts).
Change-Id: I59fcab9baa002e52a6487ed8d617185840a678ed
Improves speed by about 10-15% by combining y-only rd with
modeling function in a better way.
Also, coding efficiency improves by about 0.1%
lowres: -1.805% BDRATE with ext-inter
Change-Id: I6ef1f8942ec6806252f3fcf749ae4f30dffe42b1
When not using ext-tile, there were still dependencies between tile
rows due to various tools (eg intra predictors) relying on the above
row or above mode info, which can be in the above tile. This is now
broken (the same way as it was when ext-tile is enabled) by fixing
the appropriate predicates.
Change-Id: I107dd0d8481775a792f14e05cfbbd761f16cdc1e
When constructing the intra predictor for rectangular interintra blocks,
the last row/column of the first square is copied back into the source
image (which is the current reconstructed image buffer) before
predicting the second square. The code used to use the height instead
of width for vertical rectangles, and vice versa for horizontal
rectangles, leading to overwriting the block on the right/below. This
leads to an encode/decode mismatch if the right/below block is in a
different tile and is encoded before the current block, which did happen
with multi-threaded encoding tests. This is now fixed.
Change-Id: I073a2a447a98b842b1394d72cc774a78cb296921
This commit fixes the compiler error in high bit-depth inter
predictor when dual filter type experiment is turned on.
Change-Id: I404a76a246477f2fcffc38a3275007d5dfe229cd
Make the bit-stream level support per direction filter type coding
for motion compensated reference.
Change-Id: I61a2360b301075f6734cfd9711b7ae68f214174d
Fix 1: in ext-inter + obmc config, properly identify if the left
predictor used for obmc is a compound one in the case that the
neighbor uses wedgeinterinter pred and we will dump the ALTREF part.
This will fix the seg fault in unit test:
VP10/AltRefForcedKeyTestLarge.Frame1IsKey/0
Fix 2: in ext-tile + obmc experiment, handle the case that the
above block does not fit in the same row tile with the current one,
so as to prevent potential crashes.
Change-Id: I1c177d4f4ad15e10d11d8756e146496437753eea
Remove the restriction that the neighboring predictor cannot be
used in obmc prediction if it is an interintra or wedgeinterinter
block. The inter predictor of the interintra block, or the first
inter predictor(using LAST or GOLDEN frame) of the wedgeinterinter
block will be exploited in obmc prediction.
Coding gain: 0.248% (2.833%->3.081%) lowres
Change-Id: I4ac0368b9d2f2956f266b30c1ac97db8bafa0742
We removed this adaption, which intended to reduce the size of
overlapped region if the neighboring block is a non-skip one. Thus,
now the width/height of the overlapping region is fixed as a half of
the current block.
Performance improvement (lowres/midres): 0.111%/0.102%
Change-Id: Ife75dad9d4eb355c78a05178b50cc015c442884f