e96a7b5c92
Use packed 8-bit operations rather than unpack to 16-bit. Avoid spills. ~2.07x speedup on Haswell (x86-64). ~2.12x speedup on Haswell (x86 32-bit).