e490215990
Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 8. Because pshufb does not cross 128-bit lanes, the overhead of address calculations and loads is relatively greater as compared with an SSSE3/SSE4.1 implementation. Fall back to a generic approach for ratios > 8. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2, ~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound on Haswell as compared with the current SSE2 implementation.