Sindre Aamås 991e344d8c [Encoder] SSE2 4x4 DCT optimizations
Use a combination of instruction types that distributes more
evenly across execution ports on common architectures.

Do the horizontal DCT without transposing back and forth.

Minor tweaks.

~1.54x faster on Haswell. Should be faster on other architectures
as well.
2016-01-19 13:12:28 +01:00
..
2015-12-30 13:41:29 +08:00