
Use packed 8-bit operations rather than unpack to 16-bit. Minimize spills. ~2.31x speedup on Haswell (x86-64). ~2.40x speedup on Haswell (x86 32-bit).
Use packed 8-bit operations rather than unpack to 16-bit. Minimize spills. ~2.31x speedup on Haswell (x86-64). ~2.40x speedup on Haswell (x86 32-bit).