This is necessary to allow refactoring some x86util macros with cpuflags.
13% faster on penryn, 16% on sandybridge, 15% on bulldozer Not simd; a compiler should have generated this, but gcc didn't.