Added optimization of the 8 bit assembly quantizer routines. This makes
these functions up to 100% faster, depending on encoding parameters.
This patch maskes the encoder faster in both the high bitdepth and 8bit
configurations. In the high bitdepth configuration, it effects profile 0
only.
Based on my profiling using 1080p input the net gain is between 1-3% for
the 8 bit config, and around 2.5-4.5% for the high bitdepth config,
depending on target bitrate. The difference between the 8 bit and high
bitdepth configurations for the same encoder run is reduced by 1% in all
cases I have profiled.
Change-Id: I86714a6b7364da20cd468cd784247009663a5140
When configured with high bit detpth enabled, the 8bit quantize
function stopped using optimised code. This made 8bit content
decode slowly. This commit re-enables the SSSE3 optimisations.
Change-Id: I194b505dd3f4c494e5c5e53e020f5d94534b16b5
When configured with high bit detpth enabled, the 8bit quantize
function stopped using optimised code. This made 8bit content
decode slowly. This commit re-enables the SSE2 optimisation
(but not the SSSE3 optimisation).
Change-Id: Id015fe3c1c44580a4bff3f4bd985170f2806a9d9
When configured with high bitdepth enabled, the 8bit transform
stopped using optimised code. This made 8bit content decode slowly.
Change-Id: I67d91f9b212921d5320f949fc0a0d3f32f90c0ea
This was rewritten and moved to vpx_dsp/x86/vpx_subpixel_8t_ssse3.asm
in 195883023bb39b5ee5c6811a316ab96d9225034d
Change-Id: I117ce983dae12006e302679ba7f175573dd9e874
fixes build on windows x64; previously 'heightq' i.e., the 64-bit register
was accessed when only the 32-bit value was needed. given this is from a
stack variable the upper bits were undefined.
+ bump register/xmm counts; users of SETUP_LOCAL_VARS touch xmm13 in
64-bit builds and filter_block1d16_v* uses one extra temp variable
Change-Id: I9c768c0b2047481d1d3b11c2e16b2f8de6eb0d80
This is based on the original patch optimized for 32bit
platforms by Tamar/Ilya and now uses the x86inc style asm.
The assembly was also modified to support 64bit platforms.
Change-Id: Ice12f249bbbc162a7427e3d23fbf0cbe4135aff2
* changes:
Only build ssse3 filter functions on 64 bit
Clean up unused function warnings in vp8 encoder
Clean up unused function warnings in vp8 onyx_if.c
These were lost in the great sub pixel variance move of
6a82f0d7fb9ee908c389e8d55444bbaed3d54e9c
Not having these functions caused a ~10% performance regression in
some realtime vp8 encodes.
Change-Id: I50658483d9198391806b27899f2c0d309233c4b5
* changes:
Only use .text sections for aout
Use newer x86inc.asm
Use .text instead of .rodata on macho
Copy PIC handling code from x86_abi_support
Set 'private_extern' visibility for macho targets
Avoid 'amdnop' when building with nasm
Catch all elf formats
Expand PIC default to macho64 and respect CONFIG_PIC from libvpx
Use libvpx defines to set name mangling rules
Customize x86inc.asm for libvpx
Rename updated version of x86inc.asm
Use "private_prefix" instead of "program_name" and make vpx the default
prefix.
Change-Id: I4883a99b2aee8e5dc9f2c16a2e6f4b5d6e4de458
and FUN_CONV_2D macros. The predict lut now handles
this case. The encoder now calls vpx_scaled_2d() instead
of vpx_convolve8() for scaling.
Change-Id: Ia1c8af8a31e4cb4887a587143108cb45835f7df7
This reverts commit a5e97d874b16ae5826b68515f1e35ffb44361cf8.
Additionally:
Revert "vpx_convolve_copy_sse2: fix win64"
This reverts commit 22a8474fe7ec30d96f746dc6e4b23771758c071e.
This change performs poorly on various x86_64 devices affecting
performance by 1-3% at 1080P. Performance on chromebook like devices was
mixed neutral to slightly negative, so there should be minimal change
there.
Change-Id: I95831233b4b84ee96369baa192a2d4cc7639658c
This commit clears all the vp9_ prefix use case in vpx_dsp. It gets
the vp9 folder ready to branch out vp10.
Change-Id: I2906eec179ee792b4af8c9b4161313653050e931
This commit clears the function naming convention in vpx_dsp. It
replaces vp9_ prefix of global functions with vpx_ prefix. It also
removes the vp9_ prefix from static functions.
Change-Id: I6394359a63b71a51dda01342eec6a3cc08dfeedf
This commit moves the module inverse transform functions from vp9
to vpx_dsp folder. The hybrid transform wrapper functions stay in
the vp9 folder, since it involves codec-specific data structures.
Change-Id: Ib066367c953d3d024c73ba65157bbd70a95c9ef8
This function suffers from a couple problems in small core(tablets):
-The load of the next iteration is blocked by the store of previous iteration
-4k aliasing (between future store and older loads)
-current small core machine are in-order machine and because of it the store will spin the rehabQ until the load is finished
fixed by:
- prefetching 2 lines ahead
- unroll copy of 2 rows of block
- pre-load all xmm regiters before the loop, final stores after the loop
The function is optimized by:
copy_convolve_sse2 64x64 - 16%
copy_convolve_sse2 32x32 - 52%
copy_convolve_sse2 16x16 - 6%
copy_convolve_sse2 8x8 - 2.5%
copy_convolve_sse2 4x4 - 2.7%
credit goes to Tom Craver(tom.r.craver@intel.com) and Ilya Albrekht(ilya.albrekht@intel.com)
Change-Id: I63d3428799c50b2bf7b5677c8268bacb9fc29671
It in essence refactors the code for both the interpolation
filtering and the convolution. This change includes the moving
of all the files as well as the changing of the code from vp9_
prefix to vpx_ prefix accordingly, for underneath architectures:
(1) x86;
(2) arm/neon; and
(3) mips/msa.
The work on mips/drsp2 will be done in a separate change list.
Change-Id: Ic3ce7fb7f81210db7628b373c73553db68793c46