openh264/codec/common
Martin Storsjö 23f57adaea Do full register loads instead of single-lane loads in DeblockLumaEq4H_neon
Instead of loading the registers one lane at a time, load full
registers and then transpose them.

This is faster, reducing the runtime for the function from about
506 cycles to 434 cycles (tested on a Cortex A8).

This also avoids an issue which seems like a cpu bug, present
on Sony Xperia T (cpu implementer 0x51 architecture 7 variant 0x1
part 0x04d). On such a device, it seemed like the "vswp q9, q10"
could start executing before the previous
vld4.u8 {d20[x],d21[x],d22[x],d23[x]}, [r3], r1
had finished and written back their result. Changing the
"vswp q9, q10" into "vswp q10, q9", or into separate
"vswp d18, d20; vswp d19, d21" (or the other way around) seemed to
avoid the issue. This happened occasionally (a couple times per
100000 invocations or so).
2014-04-28 10:12:16 +03:00
..
arm Do full register loads instead of single-lane loads in DeblockLumaEq4H_neon 2014-04-28 10:12:16 +03:00
arm64 Add macros for the non-standard mov.16b/mov.8b/ext.16b/ext.8b 2014-04-23 11:47:12 +03:00
inc Remove .orig files left over from running astyle 2014-04-23 09:24:23 +03:00
src Make Wels*Snprintf return values be non-negative 2014-04-21 22:03:20 +03:00
x86 reoranize common to inc/src/x86/arm 2014-03-18 19:41:32 -07:00
targets.mk Regenerate makefiles to include the new arm64 assembly files 2014-04-23 11:44:47 +03:00