Adapt commit 982b596ea6 for the arm and aarch64 NEON asm. 5-10% faster on Cortex-A9.
982b596ea6
Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included.