The apple assembler for arm can handle the gnu binutils style
macros just fine these days, so there is no need to duplicate all
of these macros in two syntaxes, when the new one works fine in all cases.
We already require a new enough assembler to support the gnu binutils
style features since we use the .rept directive in a few places.
This file is built on its own from within the xcode projects,
even though it isn't necessary. Previously its contents was just
empty, but now a .syntax unified was added, which failed the build
when building for arm64.
Make this file a no-op, just like the other arm assembly source files,
unless HAVE_NEON is defined.
This is the default when building with the clang built-in assembler,
but not if using the external assembler - thus always specify it,
for clarity.
Also use the three-operand for of a sub instruction in BS_NZC_CHECK.
The same is already done in the gnu version of the macro.
This fixes building most of the arm assembly with Apple's external
assembler. While this isn't a necessary goal in itself, there's no
harm in doing this either.
Previously the assembly sources had mixed indentation consisting
of both spaces and tabs, making it quite hard to read unless
the right tab size was used in the editor.
Tabs have been interpreted as 4 spaces in most cases, matching
the surrounding code.
Instead of loading the registers one lane at a time, load full
registers and then transpose them.
This is faster, reducing the runtime for the function from about
506 cycles to 434 cycles (tested on a Cortex A8).
This also avoids an issue which seems like a cpu bug, present
on Sony Xperia T (cpu implementer 0x51 architecture 7 variant 0x1
part 0x04d). On such a device, it seemed like the "vswp q9, q10"
could start executing before the previous
vld4.u8 {d20[x],d21[x],d22[x],d23[x]}, [r3], r1
had finished and written back their result. Changing the
"vswp q9, q10" into "vswp q10, q9", or into separate
"vswp d18, d20; vswp d19, d21" (or the other way around) seemed to
avoid the issue. This happened occasionally (a couple times per
100000 invocations or so).