Sindre Aamås e490215990 [Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3/SSE4.1 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2,
~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios
within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
..