Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.
Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.
Minor tweaks.
Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.
Adjust tests because averaging vertically first gives slightly different
output.
~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.
Note that the widthx16 routine can be unrolled for further speedup.
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.
Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.
The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.
Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2)
WelsQuantFour4x4_avx2 (~2.32x speedup over SSE2)
WelsQuant4x4Dc_avx2 (~1.49x speedup over SSE2)
WelsQuant4x4_avx2 (~1.42x speedup over SSE2)
WelsSampleSatd16x16_avx2 (~2.31x speedup over SSE4.1 on Haswell).
WelsSampleSatd16x8_avx2 (~2.19x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x16_avx2 (~1.68x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x8_avx2 (~1.53x speedup over SSE4.1 on Haswell).
Previously the assembly sources had mixed indentation consisting
of both spaces and tabs, making it quite hard to read unless
the right tab size was used in the editor.
Tabs have been interpreted as 4 spaces in most cases, matching
the surrounding code.