Use a combination of table lookups and pshufb to convert coefficients
to zero run/level format. Two 16-entry lookup tables are used for a
total of 192 bytes worth of tables. (The existing SSE2 version uses a
table of size 2048 bytes.)
Speedup is ~1.5x-3x as compared with the SSE2 version on Haswell (the
speedup is greater for input with many trailing zeros).
The use of popcnt makes it require SSE4.2. This can be replaced with
a small LUT and accumulation which would reduce the requirement to
SSSE3.
WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2)
WelsQuantFour4x4_avx2 (~2.32x speedup over SSE2)
WelsQuant4x4Dc_avx2 (~1.49x speedup over SSE2)
WelsQuant4x4_avx2 (~1.42x speedup over SSE2)
Move asm routines to common. Delete obsolete decoder routines.
Use wider routines where applicable.
~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.
We do four blocks at a time when possible, but need to handle
single blocks at a time for intra prediction.
~3.15x speedup over MMX for the DCT on Haswell.
~2.94x speedup over MMX for the IDCT on Haswell.
Returns diminish with increasing vector length because a larger
proportion of the time is spent on load/store/shuffling.
We do four blocks at a time when possible, but need to handle
single blocks at a time for intra prediction.
~2.31x speedup over MMX for the DCT on Haswell.
~1.92x speedup over MMX for the IDCT on Haswell.
Do some shuffling in load/store unpack/pack to save some
work in horizontal DCTs.
Use a few 128-bit broadcasts to compact data vectors a bit.
~1.04x speedup for the DCT case on Haswell.
~1.12x speedup for the IDCT case on Haswell.
Use a combination of instruction types that distributes more
evenly across execution ports on common architectures.
Do the horizontal IDCT without transposing back and forth.
Minor tweaks.
~1.14x faster on Haswell. Should be faster on other architectures
as well.
Use a combination of instruction types that distributes more
evenly across execution ports on common architectures.
Do the horizontal DCT without transposing back and forth.
Minor tweaks.
~1.54x faster on Haswell. Should be faster on other architectures
as well.
This makes them consistent with the rest of the assembly source
files. Prior to f2314151e8, all the assembly files had consistent
indentation, but after that, this file had been made different.
Previously the assembly sources had mixed indentation consisting
of both spaces and tabs, making it quite hard to read unless
the right tab size was used in the editor.
Tabs have been interpreted as 4 spaces in most cases, matching
the surrounding code.
Add asm level functions
Add asm code for ME
Modify format
Add unit test for asm code.
Modify function name and format.
Remove unuse comment
Modify targets file
Add Macro protect for SSE41 funtion test
Modify according to review request.