Adapt commit 982b596ea6640bfe218a31f6c3fc542d9fe61c31 for the arm and aarch64 NEON asm. 5-10% faster on Cortex-A9.
Since RV40 and VC-1 use almost the same algorithm so optimizations for those two decoders are easy to do and included.