Sindre Aamås 8a0af4a3f2 [Processing/x86] DyadicBilinearDownsample optimizations
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.

Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.

Minor tweaks.

Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.

Adjust tests because averaging vertically first gives slightly different
output.

~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.

Note that the widthx16 routine can be unrolled for further speedup.
2016-06-02 13:44:28 +02:00
..
2014-07-25 14:41:06 +08:00
2015-11-13 17:16:26 +08:00
2014-03-24 22:14:43 -07:00
2015-04-17 14:10:12 +08:00
2016-03-24 10:01:48 +02:00