Use a combination of instruction types that distributes more
evenly across execution ports on common architectures.
Do the horizontal IDCT without transposing back and forth.
Minor tweaks.
~1.14x faster on Haswell. Should be faster on other architectures
as well.