This commit further optimizes SSE2 operations in the second 1-D
inverse 16x16 DCT, with (<10) non-zero coefficients. The average
runtime of this module goes down from 779 cycles -> 725 cycles.
Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
Optimizing all SSSE3 assembly for convolution:
1. vp9_filter_block1d4_h8_sse2
2. vp9_filter_block1d8_h8_sse2
3. vp9_filter_block1d16_h8_sse2
4. vp9_filter_block1d4_v8_sse2
5. vp9_filter_block1d8_v8_sse2
6. vp9_filter_block1d16_v8_sse2
my optimization include:
-processing 2x8 elements in one 128 bit register instead of processing
8 elements in one 128 bit register.
-removing unecessary loads.
This optimization gives between 2.4% user level gain for 480p input
and 1.6% user level gain for 720p.
This Optimization done only for 64bit.
Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969
This commit is the first patch optimizing SSE2 implementation of inverse
16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
transformation. It exploits the fact that only top-left 4x4 block contains
non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.
The average runtime of idct16x16_10 unit is reduced from
883 cycles -> 779 cycles (12% faster).
For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
down from 310651 ms -> 305910 ms. The decoding speed goes up from
80.37 fps -> 80.87 fps.
Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
This commit adds input/output ports for IDCT8_1D macro function to
provide more flexibility in variable use. It allows to skip several
buffer swap operations.
Change-Id: I21f3450509537322293043b3281bfd3949868677
This commit merges the initial buffer swap operations in idct8_1d_sse2
into the array transpose step, hence reducing number of instructions
therein.
Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
the fact that only top-left 4x4 block contains non-zero coefficients,
and hence reduces the instructions needed.
The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
frames coded at 4000kbps, the average decoding speed goes up from
79.3 fps to 79.7 fps.
Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
The performance gain of idct16x16_10_add_sse2 function is not
noticeable. However since both functions use the IDCT16_1D,
idct16x16_10_add_sse2 should be modified as well.
Tested with: park_joy_420_720p50.y4m
Change-Id: I02b957e36fcf997c677d15baf496533895271bff
vp9_idct32x32_34_add_sse2:
speedup: 1.472
IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized
based on the fact that Only upper-left 8x8 has
non-zero values.
vp9_idct32x32_1024_add_sse2:
speedup: 1.032
Tested with: park_joy_420_720p50.y4m
Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc
This patch followed "Add filter_selectively_vert_row2 to enable
parallel loopfiltering" commit, and added x86 SSE2 optimization
to do 16-pixel filtering in parallel. For other optimizations
(neon and dspr2), current 16-pixel functions were done by calling
8-pixel functions twice, and real 16-pixel functions could be added
later.
Decoder speedup:
tulip clip: 2% speed gain;
old_town_cross: 1.2% speed gain;
bus: 2% speed gain.
Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
Although no mismatch was indicated for 8/16 wide sub-pixel filters
in issue 661, they had similar problems that could cause mismatch
potentially. This patch fixed calculations in HORIZx8/16
and VERTx8/16.
Change-Id: I169961c9d40a20340995b7d22aafc89ccf30bfca
In commit "3d50da5397d20abc932d81453b26cde758293a40", the stack
pointer was modified while aligning the stack, and it needed to
be pop out at the end.
Change-Id: I062971e195f1f2ab9d0ab5fb84dcf215a0fcaa67
This patch fixed issue 661: "Decoder produces mismatched outputs
with ssse3 enabled and disabled." In sub-pixel filters, a pixel
value was multiplied by a filter coefficient, and the results
were added up. The order of adding up these multiplications had to
be arranged carefully to prevent incorrect overflowing.
Change-Id: Id08af4200fea9e1b896fc40157b8651c2c7e80f2
This rebase is a better implementation of the previous ones.
Modifications are done to reduce the total clock cycle.
Speedup: 1.341
Compiled with -O3
Tested with: park_joy_420_720p50.y4m
Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d
This patch followed "Rewrite filter_selectively_horiz for parallel
loopfiltering" commit, and added x86 SSE2 optimization to do
16-pixel filtering in parallel. Also, corrected the declaration
of aligned arrays. For 8-pixel-in-parallel case, improved the
calculation of the masks and filters. Updated the threshold loading
since the thresholds were already duplicated. Updated neon C functions
to call neon loopfilters twice.
Using tulip clip, tests showed it gave a ~1.5% decoder speed gain.
Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
This CL contains two AVX2 optimized loop filter functions,
mb_lpf_horizontal_edge_w_avx2_8 and mb_lpf_horizontal_edge_w_avx2_16.
Change-Id: I604e4fe6e99752b7800c2ea98721d97f7e0b931b
When only upper-left 8x8 area has non-zero dct coefficients, we
could skip 1D IDCT for 9th to 32th rows to save operations. This
function is called when eob <= 34.
Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5
Simple modification to reduce number of cycles in the
function.
Original function number of cycles: 973
Modified function number of cycles: 835
Improvment factor: 1.165
Tested with: park_joy_420_720p50.y4m
Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd
This patch fixed a bug that caused 32bit PIC build mismatch. The
stack pointer was modified after "GET_GOT". Loading left pointer
from a hard-coded position gave wrong result.
Change-Id: Iea0aec6f917b12a6b3393ffc986bad74510248cc
We have two SSE2-optimized functions for idct4_1d:
vp9_idct4_1d_sse2 <-- removing this one
idct4_1d_sse2
vp9_idct4_1d_sse2 was used only by the following functions which already
have SSE2 optimized variants:
vp9_idct4x4_16_add_c -> vp9_idct4x4_16_add_see2
idct8_1d -> vp9_idct8x8_{16, 10, 1}_see2
vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2
Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb
To ensure fast encoding/decoding on devices without ssse3 support,
SSE2 optimization of sub-pixel filters was done. Test using 1080p
clip showed the decoder speeds were ~70fps with ssse3 filters, ~60fps
with sse2 filters, and ~15fps with c filters.
Change-Id: Ie2088f87d83a889fba80a613e4d0e287aadd785c
The idea is to have the following names for each transform size:
vp9_idct4x4_add
vp9_idct4x4_1_add
vp9_idct4x4_10_add
vp9_idct4x4_16_add
vp9_idct8x8_add
vp9_idct8x8_1_add
vp9_idct8x8_10_add
vp9_idct8x8_64_add
etc for 16x16, 32x32
The actual list of renames in this patch:
vp9_idct_add_lossless -> vp9_iwht4x4_add
vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add
vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
vp9_idct_add -> vp9_idct4x4_add
vp9_short_idct4x4_add -> vp9_idct4x4_16_add
vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
In subpixel filters, prefetched source data, unrolled loops,
and interleaved instructions.
In HORIZx4, integrated the idea in Scott's CL (commit:
d22a504d11), which was suggested by
Erik/Tamar from Intel. Further tweaking was done to combine row 0,
2, and row 1, 3 in registers to do more 2-row-in-1 operations until
the last add.
Test showed a ~2% decoder speedup.
Change-Id: Ib53d04ede8166c38c3dc744da8c6f737ce26a0e3
Interleaved the instructions, reduced register dependency, and
prefetched the source data. This improved the decoder speed
by 0.6% - 2%.
Change-Id: I568067aa0c629b2e58219326899c82aedf7eccca
byte version of ronalds d153 ssse3 optimizations for
4x4 and 8x8
(commit: fc91a2a112238a1aee568f3b840585de4e928fca)
Change-Id: Iec4426032311483f615fd9e0dceba3ee85ddebd7
Current x86inc.asm didn't handle 32bit PIC build properly.
TEXTRELs were seen in the library built. The PIC macros from
libvpx's x86_abi_support.asm was used to fix this problem.
The assembly code was modified to use the macros.
Notes: We need this fix in for decoder building. Functions in
encoder will be fixed later.
Change-Id: Ifa548d37b1d0bc7d0528db75009cc18cd5eb1838
This is incompatible with most toolchains other than gcc.
Revert "Deleted #include <inttypes.h>"
This reverts commit 4d018be950.
This reverts commit d22a504d11.
Change-Id: I1751dc6831f4395ee064e6748281418e967e1dcf
Reformatted version of a patch submitted by Erik/Tamar
from Intel. For the test clips used, the decoder
performance improved by ~2%.
Change-Id: Ifbc37ac6311bca9ff1cfefe3f2e9b7f13a4a511b
This patch is a reformatted version of optimizations done by
engineers at Intel (Erik/Tamar) who have been providing
performance feedback for VP9. For the test clips used (720p, 1080p),
up to 1.2% performance improvement was seen.
Change-Id: Ic1a7149098740079d5453b564da6fbfdd0b2f3d2
This commit exploits the sparsity of quantized coefficient matrix.
It detects each 32x8 array and skip the corresponding inverse
transformation if all entries are zero.
For ped1080p at 8000 kbps, this on average reduces the runtime of
32x32 inverse 2D-DCT SSE2 function from 6256 cycles -> 5200
cycles. It makes the overall encoding process about 2% faster at
speed 0. The speed-up is more pronounceable for the decoding process.
Change-Id: If20056c3566bd117642a76f8884c83e8bc8efbcf
This commit provides special handle on 16x16 inverse 2D-DCT, where
only DC coefficient is quantized to be non-zero value.
Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
This commit enables a special handle for the 8x8 inverse 2D-DCT,
where only DC coefficient is quantized to be non-zero. For bus_cif
at 2000 kbps, it provides about 1% speed-up at speed 0.
Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
Add SSE2 implementation to handle the special case of inverse 2D-DCT
where only DC coefficient is non-zero.
Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
This provides about 1% encoding speed-up at speed 0.
Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
transform. The runtime goes from 1216 cycles -> 266 cycles.
For bus_cif at 2000 kbps, the overall runtime reduces from
253707ms -> 248430ms, i.e., 2% speed-up at speed 0.
Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
Independent horizontal and vertical implementations.
Requires that blocks be built from 4x4 and [xy]_step_q4 == 16
6-10% improvement. CIF improved the least.
Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from
292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the
overall runtime of speed 0 goes from 301s to 295s (2% speed-up).
Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628
Where possible, do the 16 pixel wide filter while doing the horizontal
filtering pass. The same approach can be taken for the mbloop_filter
when that's implemented. Doing so on the vertical pass is a little more
involved, but possible.
Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320
This probably has a mildly negative impact on performance, but will
(in future commits - or possibly merged with this one) allow SIMD
implementations of individual intra prediction functions. We may
perhaps want to consider having separate functions per txfm-size
also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
each intra prediction mode), but I haven't played much with that
yet.
Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269