- Add unit tests to verify the bit-exact result.
- User level time reduction (EXT_TX):
encoder: 3.63%
decoder: 2.36%
- Also add tx_type=V_DCT...H_FLIPADST SSE2 for 16x16 inv txfm.
Change-Id: Idc6d9e8254aa536e5f18a87fa0d37c6bd551c083
- Use range check function to avoid DCT_DCT overflow.
We need to re-develop the column txfm side scaling/rounding. Now,
we prefer to maintain the current BDRate level.
- Encoder user level time reduction <1% owing to av1_fht32x32_avx2.
- Add MemCheck unit test and fdct32() unit test.
Change-Id: I1e67030f67bc637859798ebe2f6698afffb8531c
- av1_fht32x32 AVX2 function level time reduction ~89% compared to C.
- av1_fht32x32_avx2() on DCT_DCT improves 42.62% over aom_fdct32x32_avx2()
But function replacement must go with the corresponding inverse txfm.
- No obvious user level time reduction due to 32x32 TX_TYPE selection.
- Zero high 128b YMM to avoid AVX-SSE transition penalties
(fix 16x16 case).
- Added 32x32 AVX2 unit tests to verify bitexact.
- AVX2 optimization summary:
On CPU i7-6700, based on 16x16/32x32 fwd txfm optimization results:
C to AVX2: function level time reduction, ~86-89%.
SSE2 to AVX2: function level time reduction, ~51%.
Change-Id: Idd0cd8bf066a61c7117140ef15ab6c1f8eb4b036
- Unit tests are added for AVX2 SIMD.
- Encoder speed improvement:
AV1 baseline and EXT_TX, three 1080p sequences at bitrate:
800 Kbps, 2 Mbps, 6 Mbps, on i7-6700 CPU, average
user level time reduction: 3.86%.
Change-Id: Ibbd7837ee3a831c6b1e4e471bf6c8d3fa3a19ff4