openh264

Author	SHA1	Message	Date
Martin Storsjö	0995390c4a	Remove apple specific versions of arm macros with arguments The apple assembler for arm can handle the gnu binutils style macros just fine these days, so there is no need to duplicate all of these macros in two syntaxes, when the new one works fine in all cases. We already require a new enough assembler to support the gnu binutils style features since we use the .rept directive in a few places.	2015-03-27 11:11:45 +02:00
Martin Storsjö	0b0884874d	Remove superfluous .text directives at the start of arm assembly files This directive can be set by the common include header that is included by all files anyway.	2015-03-27 10:46:34 +02:00
zhiliang wang	01b74ea7c1	Add asm code for NoneZeroCount and refine related code	2015-01-04 16:39:17 +08:00
Martin Storsjö	38d2d64ede	Explicitly add .syntax unified when building for iOS This is the default when building with the clang built-in assembler, but not if using the external assembler - thus always specify it, for clarity. Also use the three-operand for of a sub instruction in BS_NZC_CHECK. The same is already done in the gnu version of the macro. This fixes building most of the arm assembly with Apple's external assembler. While this isn't a necessary goal in itself, there's no harm in doing this either.	2014-08-08 14:09:37 +03:00
Martin Storsjö	57f6bcc4b0	Convert all tabs to spaces in assembly sources, unify indentation Previously the assembly sources had mixed indentation consisting of both spaces and tabs, making it quite hard to read unless the right tab size was used in the editor. Tabs have been interpreted as 4 spaces in most cases, matching the surrounding code.	2014-06-01 01:35:43 +03:00
Martin Storsjö	ac03b8b503	Avoid unnecessary tabs in macro declarations	2014-06-01 01:13:01 +03:00
Martin Storsjö	932a38abc0	Reformat the copyright header of deblocking_neon.S This makes it identical to the ones in the other files.	2014-05-31 13:44:21 +03:00
dongzhang	218adc7e29	Fix a bug in deblocking for neon 32 bit arm implementation	2014-05-09 14:06:16 +08:00
Martin Storsjö	23f57adaea	Do full register loads instead of single-lane loads in DeblockLumaEq4H_neon Instead of loading the registers one lane at a time, load full registers and then transpose them. This is faster, reducing the runtime for the function from about 506 cycles to 434 cycles (tested on a Cortex A8). This also avoids an issue which seems like a cpu bug, present on Sony Xperia T (cpu implementer 0x51 architecture 7 variant 0x1 part 0x04d). On such a device, it seemed like the "vswp q9, q10" could start executing before the previous vld4.u8 {d20[x],d21[x],d22[x],d23[x]}, [r3], r1 had finished and written back their result. Changing the "vswp q9, q10" into "vswp q10, q9", or into separate "vswp d18, d20; vswp d19, d21" (or the other way around) seemed to avoid the issue. This happened occasionally (a couple times per 100000 invocations or so).	2014-04-28 10:12:16 +03:00
Licai Guo	e39de8d404	reoranize common to inc/src/x86/arm	2014-03-18 19:41:32 -07:00

10 Commits