Ported from ARMv7 NEON.
Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included.