We do four blocks at a time when possible, but need to handle
single blocks at a time for intra prediction.
~3.15x speedup over MMX for the DCT on Haswell.
~2.94x speedup over MMX for the IDCT on Haswell.
Returns diminish with increasing vector length because a larger
proportion of the time is spent on load/store/shuffling.