Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.
Fall back to a generic approach for ratios > 4.
The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.
The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.
Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.
Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.
The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.
Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
Use a combination of table lookups and pshufb to convert coefficients
to zero run/level format. Two 16-entry lookup tables are used for a
total of 192 bytes worth of tables. (The existing SSE2 version uses a
table of size 2048 bytes.)
Speedup is ~1.5x-3x as compared with the SSE2 version on Haswell (the
speedup is greater for input with many trailing zeros).
The use of popcnt makes it require SSE4.2. This can be replaced with
a small LUT and accumulation which would reduce the requirement to
SSSE3.
WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2)
WelsQuantFour4x4_avx2 (~2.32x speedup over SSE2)
WelsQuant4x4Dc_avx2 (~1.49x speedup over SSE2)
WelsQuant4x4_avx2 (~1.42x speedup over SSE2)