About 30% faster on 32 bit Atom, 120% faster on 64 bit Phenom2.
This is interesting because supporting P16 is easier in e.g.
OpenGL (can misuse support for any 2-component 8 bit format),
whereas supporting p9/p10 without conversion needs a texture
format with at least 14 bits actual precision.
The shiftonly == 0 case is not optimized since the code is more
complex and the speed gain less obvious.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>