We can optimize wedge partition selection by pre-computing the
residuals of the 2 underlying predictors, and then blend these
to compute the sse of the compound predictor, without actually
having to compute and subtract the compound predictor.
Similarly we can pre-compute a proxy array which we can use to
cheaply check which mask sign would have lower sse.
Details are in wedge_utils.c.
Mathematically these are equivalence transformations, but due to the
finite precision the encoder output will be perturbed, though on
average this should make 0% difference.
ext-inter gains about ~4.5% speedup.
Change-Id: Ib2657c3209ae161b4090b58b4b6c392641bf2792
xd->plane[0].n4_h and xd->plane[0].n4_w are not set at that point
when using supertx.
While this fixes the immediate crash described in the referenced
bug report, there are still issues in the ref-mv experiment that
causes these tests to fail, so they are kept disabled.
BUG=https://bugs.chromium.org/p/webm/issues/detail?id=1230
Change-Id: Ibf8ef02847a903f8d10e6be28e16694db10c75af
Replaced vpx_d45_predictor_4x4_ssse3(), vpx_d45_predictor_8x8_ssse3()
and vpx_d207_predictor_4x4_ssse3() with
created vpx_d45_predictor_4x4_sse2(), vpx_d45_predictor_8x8_sse2()
and vpx_d207_predictor_4x4_sse2() respectively.
It's mostly neutral or slightly worse than ssse3 in good cases and
better than ssse3 in the bad cases (but still worse than using the mmx
regs).
Change-Id: Ib0237ceb71d2c57b8a93fd3170330cfed9d56bdd
convert the random value to int16 before subtracting 256 from it; quiets
a ubsan (sanitize=integer) warning
BUG=webm:1225
Change-Id: Ibc2c5a21f30e112bd6c180f7d6a033327c38d0df
Function level timing test shows about 27% time saving on
a Xeon E5-2680 v2 desktop.
Rename vp9_dct_sse2.c to vp9_dct_intrin_sse2.c for vp9 and
rename dct_sse2.c to dct_intrin_sse2.c for vp10 to avoid
duplicate basenames.
Actually vp9_fwht4x4_mmx/sse2() and vp10_fwht4x4_mmx/sse2()
are identical. TODO: They should be unified later if there is
no intention to keep a duplicate.
Change-Id: I3e537b7bbd9ba417c606cd7c68c4dbbfa583f77d
input_, ref_input_ and output_ were being allocated with new[] followed
by vpx_memalign, remove the former
Change-Id: Ia16d0f9b9317042a24445095ad3c284f4e7bb481
Followed the code style of other lpf fuctions.
These 2 functions put 2 rows of data in a single xmm register,
so they have similar but not identical filter operations,
and cannot share the same macros.
Change-Id: I3bab55a5d1a1232926ac8fd1f03251acc38302bc
This is to replace vp10/common/reconinter.c:build_masked_compound.
Functionality is equivalent, but the interface is slightly more
generic.
Total encoder speedup with ext-inter: ~7.5%
Change-Id: Iee18b83ae324ffc9c7f7dc16d4b2b06adb4d4305
This reverts commit 2468163e0770108f5216b65445ce05a8241bca21.
causes valgrind errors for overread of buffer in SubpelVarianceTest
Change-Id: I448e52c76f815ac199305b71f7d169f2bc167679
- Confirm input coeff buffer is 16-byte aligned.
- sizeof() prefer variable name instead of type.
- Fix function name (Capital first letter then Pascal case).
- Long base class name uses a newline (with colon and 4 space indent).
- Remove a unnecessary reference function variable.
- Method declaration precedes variable declaration in class definition.
Change-Id: I317f7e679926b5219f58c5f7d14512e94985e7fe
- Integrate 5 flip transform types for each 4x4, 8x8, and 16x16
block, for experiment, EXT_TX.
- Encoder speed improves about 12%-15%.
- Update the unit tests for bit-exact result against C.
Change-Id: Idf27c87f1e516ca5b66c7b70142477a115404ccb