This commit enables SSSE3 version full inverse 8x8 2D-DCT and
reconstruction. It makes the runtime of vp9_idct8x8_64_add down
from 256 cycles (SSE2) to 246 cycles.
Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
Assembly implementation of ssse3 8x8 forward 2D-DCT. The current
version is turned on only for x86_64. The average unit runtime
goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster.
This translates into about 1.5% speed-up for pedestrian_area 1080p
at speed 2.
Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4
We don't use declarations from this file. The real declarations
(differently named) are in vp9_rtcd_defs.pl, e.g. vp9_full_search_sad.
Change-Id: I73cbf064305710ba20747233cfdbe67366f069a0
2 functions were optimized for avx2 by using full 256 bit register
In order to handle 32 elements in parallel instead of only 16 in parallel:
1. vp9_sad32x32x4d
2. vp9_sad64x64x4d
The function level gain is 66% and the user level gain is ~1%.
Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb
* speed improvment of 30 percent achieved
* multiplies and adds remain the same
* non-arithmetic instructions minimized by hand, by:
-expanding 2 pass loop
-removing irrelivant "shuffles"
-combining last two rounding steps
* further improvments may be possible
Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0
Optimizing 2 functions to process 32 elements in parallel instead of 16:
1. vp9_sub_pixel_avg_variance64x64
2. vp9_sub_pixel_avg_variance32x32
both of those function were calling vp9_sub_pixel_avg_variance16xh_ssse3
instead of calling that function, it calls vp9_sub_pixel_avg_variance32xh_avx2
that is written in avx2 and process 32 elements in parallel.
This Optimization gave 80% function level gain and 2% user level gain
Change-Id: Iea694654e1b7612dc6ed11e2626208c2179502c8
Optimizing 2 functions to process 32 elements in parallel instead of 16:
1. vp9_sub_pixel_variance64x64
2. vp9_sub_pixel_variance32x32
both of those function were calling vp9_sub_pixel_variance16xh_ssse3
instead of calling that function, it calls vp9_sub_pixel_variance32xh_avx2
that is written in avx2 and process 32 elements in parallel.
This Optimization gave 70% function level gain and 2% user level gain
Change-Id: I4f5cb386b346ff6c878a094e1c3b37e418e50bde
A bug was reported in Issue 702: "SIGILL (Illegal instruction) when
transcoding with vp9 - using FFmpeg". It was reproduced and fixed.
Change-Id: Ie32c149a89af02856084aeaf289e848a905c7700
Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32,
vp9_variance64x64, vp9_variance32x16, vp9_variance64x32,
vp9_mse16x16 by migrating to AVX2
some of the functions were optimized by processing 32 elements instead of 16.
some of the functions were optimized by processing 2 loop strides of 16
elements in a single 256 bit register
This optimization gives between 2.4% - 2.7% user level performance gain
and 42% function level gain.
Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
Modifications are done to reduce the total clock cycle.
Speedup: 1.2
Tested with: park_joy_420_720p50.y4m
Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f
The step that sums three input samples could potentially cause the
intermediate result go beyond 16 bit limit, when operating as the
second 1-D transform. This commit fixes the issue.
Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516
This patch fixed the issue reported in "Issue 655: remove textrel's
from 32-bit vp9 encoder". The set of vp9_subpel_variance functions
that used x86inc.asm ABI didn't build correctly for 32bit PIC. The
fix was carefully done under the situation that there was not
enough registers.
After the change, we got
$ eu-findtextrel libvpx.so
eu-findtextrel: no text relocations reported in 'libvpx.so'
Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82
For consistency with idct function names. Renames:
vp9_short_fdct4x4 -> vp9_fdct4x4
vp9_short_walsh4x4 -> vp9_fwht4x4
Change-Id: Id15497cc1270acca626447d846f0ce9199770f58
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4
Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two
instructions. Then inlined them.
quoting from intel MMX_App_Compute_16bit_Vector.pdf
"The PMADDWD instruction multiplies four
pairs of 16-bit numbers and produces partial sums of the results
and can do so once per clock (with a three-clock latency)."
so I am assuming that there will be three clock overhead after the
last _mm_madd_pi16 command.
Even with the overhead the number of clocks in general should be
smaller. I am not sure though becasue I could not find information
about number of clocks required for instructions in k_cvtlo_epi16
and k_cvthi_epi16. I will run a test and compare the execution time.
Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7
The 16x16 transform unit test suggested that the peak coefficient
value can reach 32639. This could cause potential overflow issue
in the SSSE3 implmentation of 16x16 block quantization. This commit
fixes this issue by replacing addition with saturated addition.
Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e