fc16010583
Use packed 8-bit operations rather than unpack to 16-bit. Avoid spills. ~2.68x speedup on Haswell (x86-64). ~2.38x speedup on Haswell (x86 32-bit).