10 Commits

Author SHA1 Message Date
Martin Storsjö
0995390c4a Remove apple specific versions of arm macros with arguments
The apple assembler for arm can handle the gnu binutils style
macros just fine these days, so there is no need to duplicate all
of these macros in two syntaxes, when the new one works fine in all cases.

We already require a new enough assembler to support the gnu binutils
style features since we use the .rept directive in a few places.
2015-03-27 11:11:45 +02:00
Martin Storsjö
0b0884874d Remove superfluous .text directives at the start of arm assembly files
This directive can be set by the common include header that is
included by all files anyway.
2015-03-27 10:46:34 +02:00
zhiliang wang
01b74ea7c1 Add asm code for NoneZeroCount and refine related code 2015-01-04 16:39:17 +08:00
Martin Storsjö
38d2d64ede Explicitly add .syntax unified when building for iOS
This is the default when building with the clang built-in assembler,
but not if using the external assembler - thus always specify it,
for clarity.

Also use the three-operand for of a sub instruction in BS_NZC_CHECK.
The same is already done in the gnu version of the macro.

This fixes building most of the arm assembly with Apple's external
assembler. While this isn't a necessary goal in itself, there's no
harm in doing this either.
2014-08-08 14:09:37 +03:00
Martin Storsjö
57f6bcc4b0 Convert all tabs to spaces in assembly sources, unify indentation
Previously the assembly sources had mixed indentation consisting
of both spaces and tabs, making it quite hard to read unless
the right tab size was used in the editor.

Tabs have been interpreted as 4 spaces in most cases, matching
the surrounding code.
2014-06-01 01:35:43 +03:00
Martin Storsjö
ac03b8b503 Avoid unnecessary tabs in macro declarations 2014-06-01 01:13:01 +03:00
Martin Storsjö
932a38abc0 Reformat the copyright header of deblocking_neon.S
This makes it identical to the ones in the other files.
2014-05-31 13:44:21 +03:00
dongzhang
218adc7e29 Fix a bug in deblocking for neon 32 bit arm implementation 2014-05-09 14:06:16 +08:00
Martin Storsjö
23f57adaea Do full register loads instead of single-lane loads in DeblockLumaEq4H_neon
Instead of loading the registers one lane at a time, load full
registers and then transpose them.

This is faster, reducing the runtime for the function from about
506 cycles to 434 cycles (tested on a Cortex A8).

This also avoids an issue which seems like a cpu bug, present
on Sony Xperia T (cpu implementer 0x51 architecture 7 variant 0x1
part 0x04d). On such a device, it seemed like the "vswp q9, q10"
could start executing before the previous
vld4.u8 {d20[x],d21[x],d22[x],d23[x]}, [r3], r1
had finished and written back their result. Changing the
"vswp q9, q10" into "vswp q10, q9", or into separate
"vswp d18, d20; vswp d19, d21" (or the other way around) seemed to
avoid the issue. This happened occasionally (a couple times per
100000 invocations or so).
2014-04-28 10:12:16 +03:00
Licai Guo
e39de8d404 reoranize common to inc/src/x86/arm 2014-03-18 19:41:32 -07:00