Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.
Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.
Minor tweaks.
Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.
Adjust tests because averaging vertically first gives slightly different
output.
~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.
Note that the widthx16 routine can be unrolled for further speedup.