7486de2844
Do some shuffling in load/store unpack/pack to save some work in horizontal DCTs. Use a few 128-bit broadcasts to compact data vectors a bit. ~1.04x speedup for the DCT case on Haswell. ~1.12x speedup for the IDCT case on Haswell.