8a0af4a3f2
Average vertically before horizontally; horizontal averaging is more worksome. Doing the vertical averaging first reduces the number of horizontal averages by half. Use pmaddubsw and pavgw to do the horizontal averaging for a slight performance improvement. Minor tweaks. Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines. The non-temporal loads used in the SSE4 routines do nothing for cache- backed memory AFAIK. Adjust tests because averaging vertically first gives slightly different output. ~2.39x speedup for the widthx32 routine on Haswell when not memory-bound. ~2.20x speedup for the widthx16 routine on Haswell when not memory-bound. Note that the widthx16 routine can be unrolled for further speedup. |
||
---|---|---|
.. | ||
BaseDecoderTest.cpp | ||
BaseEncoderTest.cpp | ||
c_interface_test.c | ||
cpp_interface_test.cpp | ||
DataGenerator.cpp | ||
decode_api_test.cpp | ||
decode_encode_test.cpp | ||
decoder_ec_test.cpp | ||
decoder_test.cpp | ||
encode_decode_api_test.cpp | ||
encode_decode_api_test.h | ||
encode_decode_api_test.template | ||
encode_options_test.cpp | ||
encoder_test.cpp | ||
ltr_test.cpp | ||
sha1.c | ||
simple_test.cpp | ||
targets.mk |