Split to load_input_data4() and load_input_data8().
Use pack with signed saturation instruction for high bitdepth.
Change-Id: Icda3e0129a6fdb4a51d1cafbdc652ae3a65f4e06
vpx_idct32x32_1024_add_ssse3() is actually a sse2 function and faster
than vpx_idct32x32_1024_add_sse2(). Replace the slow one. All are
code relocations, no new code.
Change-Id: I5dac0e98cc411a4ce05660406921118986638d19
It's almost identical with vpx_idct8x8_64_add_sse2(), except little
difference in instructions order.
Change-Id: Ie60dabc35eaa6ebae7c755e6cff00a710aad284f
- Refer to patch: 48fca113d inv_txfm_ssse3,butterfly: fix win32 abi
compatibility.
- Change four butterfly() calls to butterfly_self(), to simplify the
operations.
Change-Id: Ib2a8cfe6cddcaf0a59e6e6270d8380055ea42ef3
only the first 3 parameters can be aligned to 16 as required by __m128i,
make them all pointers for consistency.
since:
07c48ccfe Improve idct32x32_34_add SSSE3 intrinsics performance
BUG=webm:1384
Change-Id: I0324f701e723a27cb470036a180693ba8829d01d
- Split the inv txfm into three parts to avoid stack spillover.
- Function level speed improves ~12%.
- Use function and macro to remove some repeated code.
Change-Id: I14f5f072334fd766808cb52bf648df792e7379ee
- Split the transform into first half and second half.
- Reschedule the instructions to avoid stack spillover.
- Function level speed improves ~16%.
Change-Id: I166889840d23aa8a273eca00f6fbdae8b4566f35
- Replace the corresponding assembly code.
- No user level speed performance degrade.
- Unit tests passed.
Change-Id: Idd0c5a4bad4976f1617c34100cb46e75e3b961e5