This commit further optimizes SSE2 operations in the second 1-D
inverse 16x16 DCT, with (<10) non-zero coefficients. The average
runtime of this module goes down from 779 cycles -> 725 cycles.
Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
Optimizing all SSSE3 assembly for convolution:
1. vp9_filter_block1d4_h8_sse2
2. vp9_filter_block1d8_h8_sse2
3. vp9_filter_block1d16_h8_sse2
4. vp9_filter_block1d4_v8_sse2
5. vp9_filter_block1d8_v8_sse2
6. vp9_filter_block1d16_v8_sse2
my optimization include:
-processing 2x8 elements in one 128 bit register instead of processing
8 elements in one 128 bit register.
-removing unecessary loads.
This optimization gives between 2.4% user level gain for 480p input
and 1.6% user level gain for 720p.
This Optimization done only for 64bit.
Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969
This commit is the first patch optimizing SSE2 implementation of inverse
16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
transformation. It exploits the fact that only top-left 4x4 block contains
non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.
The average runtime of idct16x16_10 unit is reduced from
883 cycles -> 779 cycles (12% faster).
For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
down from 310651 ms -> 305910 ms. The decoding speed goes up from
80.37 fps -> 80.87 fps.
Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32,
vp9_variance64x64, vp9_variance32x16, vp9_variance64x32,
vp9_mse16x16 by migrating to AVX2
some of the functions were optimized by processing 32 elements instead of 16.
some of the functions were optimized by processing 2 loop strides of 16
elements in a single 256 bit register
This optimization gives between 2.4% - 2.7% user level performance gain
and 42% function level gain.
Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
This commit adds input/output ports for IDCT8_1D macro function to
provide more flexibility in variable use. It allows to skip several
buffer swap operations.
Change-Id: I21f3450509537322293043b3281bfd3949868677
Adding RefBuffer to simplify reference buffer management. The struct has a
pointer to image data and scale factors relative to the current frame.
Change-Id: If38eb1491ff687cc11428aee339f3e052e2c5d9e
This commit merges the initial buffer swap operations in idct8_1d_sse2
into the array transpose step, hence reducing number of instructions
therein.
Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
the fact that only top-left 4x4 block contains non-zero coefficients,
and hence reduces the instructions needed.
The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
frames coded at 4000kbps, the average decoding speed goes up from
79.3 fps to 79.7 fps.
Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
Moving back to scale_factors struct. We don't need anymore x_offset_q4 and
y_offset_q4 because both values are calculated locally inside vp9_scale_mv
function.
Change-Id: I78a2122ba253c428a14558bda0e78ece738d2b5b
Before mv scaling it is required to calculate x_offset_q4/y_offset_q4
by calling set_scaled_offsets(). Now offset configuration can not be
missed because it happens just before scale_mv().
Change-Id: I7dd1a85b85811a6cc67c46c9b01e6ccbbb06ce3a