Continue processing sets of 16 values. Plenty of improvement for 4x8
(doubles the speed) but only about 30% for 4x4.
BUG=webm:1422
Change-Id: Ib8dd96f75d474f0348800271d11e58356b620905
Read in a Q register. Works on blocks of 16 and larger.
Improvement of about 20% for 64x64. The smaller blocks are faster, but
don't have quite the same level of improvement. 16x32 is only about 5%
BUG=webm:1422
Change-Id: Ie11a877c7b839e66690a48117a46657b2ac82d4b
When the width is equal to 8, process two rows at a time. This doubles
the speed of 8x4 and improves 8x8 by about 20%.
8x16 was using this technique already, but still improved a little bit
with the rewrite.
Also use this for vpx_get8x8var_neon
BUG=webm:1422
Change-Id: Id602909afcec683665536d11298b7387ac0a1207
Some of the mixed sizes were missing. They can be implemented trivially
using the existing helper function.
When comparing the previous 16x8 and 8x16 implementations, the helper
function is about 10% faster than the 16x8 version. The 8x16 is very
close, but the existing version appears to be faster.
BUG=webm:1422
Change-Id: Ib0e856083c1893e1bd399373c5fbcd6271a7f004
removes some unnecessary casts and adds a few explicit uint32 ones for
larger sizes to quiet -Wshorten-64-to-32 warnings
Change-Id: I63c5fce8e62c426d5cf5c10a66a113c119a43518