Replace by CAST_TO_BYTEPTR/SHORTPTR.
The rule is: if a short ptr is casted to a byte ptr, any offset
operation on the byte ptr must be doubled. We do this by casting to
short ptr first, adding offset, then casting back to byte ptr.
BUG=webm:1388
Change-Id: I9e18a73ba45ddae58fc9dae470c0ff34951fe248
Buffers on 32 bit x86 builds only guaranteed 8 byte alignment. Fixed
with "AvgPred test: use aligned buffers" and "sad avg: align
intermediate buffer"
Also re-enable asserts on the C version.
BUG=webm:1390
Change-Id: I93081f1b0002a352bb0a3371ac35452417fa8514
Provides over 15x speedup for width > 8.
Due to smaller loads and shifting for width == 8 it gets about 8x
speedup.
For width == 4 it's only about 4x speedup because there is a lot of
shuffling and shifting to get the data properly situated.
BUG=webm:1390
Change-Id: Ice0b3dbbf007be3d9509786a61e7f35e94bdffa8
For 8-bit the subtrahend is small enough to fit into uint32_t.
This is the same that was done for:
c0241664a Resolve -Wshorten-64-to-32 in variance.
For 10/12-bit apply:
63a37d16f Prevent negative variance
Change-Id: Iab35e3f3f269035e17c711bd6cc01272c3137e1d
d207/d63/d45/d117/d135/d153
~9-45% better depending on the predictor on 32-bit ARM, similar range on
x86-64
this matches the non-highbitdepth implementation
BUG=webm:1316
Change-Id: Iddebdf7c58c6f31c47cae04da95c6e5318200e4c
- Refer to patch: 48fca113d inv_txfm_ssse3,butterfly: fix win32 abi
compatibility.
- Change four butterfly() calls to butterfly_self(), to simplify the
operations.
Change-Id: Ib2a8cfe6cddcaf0a59e6e6270d8380055ea42ef3
Similar issue as Change bc1c18e.
The PartialIDctTest.ResultsMatch test on vpx_idct32x32_135_add_neon()
in high bit-depth mode exposes 16-bit overflow in final stage of pass
2, when changing the test number from 1,000 to 1,000,000.
Change to use saturating add/sub for vpx_idct32x32_34_add_neon(),
vpx_idct32x32_135_add_neon and vpx_idct32x32_1024_add_neon() in high
bit-depth mode.
Change-Id: Iaec0e9aeab41a3fdb4e170d7e9b3ad1fda922f6f
only the first 3 parameters can be aligned to 16 as required by __m128i,
make them all pointers for consistency.
since:
07c48ccfe Improve idct32x32_34_add SSSE3 intrinsics performance
BUG=webm:1384
Change-Id: I0324f701e723a27cb470036a180693ba8829d01d
- Split the inv txfm into three parts to avoid stack spillover.
- Function level speed improves ~12%.
- Use function and macro to remove some repeated code.
Change-Id: I14f5f072334fd766808cb52bf648df792e7379ee
Most are cosmetics changes.
Speed has no change with clang 3.8, and about 5% faster with gcc 4.8.4
Tried the strategy used in 8x8 and 16x16 (which operations' orders are
similar to the C code), though speed gets better with gcc, it's worse
with clang.
Tried to remove store_in_output(), but speed gets worse.
Change-Id: I93c8d284e90836f98962bb23d63a454cd40f776e
When eob is less than or equal to 135 for high-bitdepth 32x32 idct,
call this function.
BUG=webm:1301
Change-Id: I8a5864f5c076e449c984e602946547a7b09c9fe6
- Split the transform into first half and second half.
- Reschedule the instructions to avoid stack spillover.
- Function level speed improves ~16%.
Change-Id: I166889840d23aa8a273eca00f6fbdae8b4566f35