openh264

Author	SHA1	Message	Date
Martin Storsjö	57f6bcc4b0	Convert all tabs to spaces in assembly sources, unify indentation Previously the assembly sources had mixed indentation consisting of both spaces and tabs, making it quite hard to read unless the right tab size was used in the editor. Tabs have been interpreted as 4 spaces in most cases, matching the surrounding code.	2014-06-01 01:35:43 +03:00
Martin Storsjö	ac03b8b503	Avoid unnecessary tabs in macro declarations	2014-06-01 01:13:01 +03:00
Martin Storsjö	932a38abc0	Reformat the copyright header of deblocking_neon.S This makes it identical to the ones in the other files.	2014-05-31 13:44:21 +03:00
dongzhang	218adc7e29	Fix a bug in deblocking for neon 32 bit arm implementation	2014-05-09 14:06:16 +08:00
Martin Storsjö	23f57adaea	Do full register loads instead of single-lane loads in DeblockLumaEq4H_neon Instead of loading the registers one lane at a time, load full registers and then transpose them. This is faster, reducing the runtime for the function from about 506 cycles to 434 cycles (tested on a Cortex A8). This also avoids an issue which seems like a cpu bug, present on Sony Xperia T (cpu implementer 0x51 architecture 7 variant 0x1 part 0x04d). On such a device, it seemed like the "vswp q9, q10" could start executing before the previous vld4.u8 {d20[x],d21[x],d22[x],d23[x]}, [r3], r1 had finished and written back their result. Changing the "vswp q9, q10" into "vswp q10, q9", or into separate "vswp d18, d20; vswp d19, d21" (or the other way around) seemed to avoid the issue. This happened occasionally (a couple times per 100000 invocations or so).	2014-04-28 10:12:16 +03:00
Licai Guo	e39de8d404	reoranize common to inc/src/x86/arm	2014-03-18 19:41:32 -07:00

6 Commits