The .align directive takes an argument in number of bits, i.e. the
actual alignment is 2^n. Previously building with binutils failed,
since 16 isn't a valid parameter to .align, the maximum is 15.
Thus, this makes the code try to align to 16 bytes, instead of aligning
to 65536 bytes.
This fixes building for android.
This also clears up the same mistake in the aarch64 code, even though
that one built just fine.
In WelsHadamardQuant2x2SkipKernel_AArch64_neon, write the output
of cmhi into v0 - this is the register that is assumed to hold
the output.
In WelsHadamardQuant2x2_AArch64_neon, subtract the count of zero
elements from 4, not from 16.
This makes the encoding unit tests pass again.
According to the arm architecture reference manual, the mov (vector)
instruction can only use the arrangement specifiers '8b' and '16b'.
The apple tools still accept the '8h' form, but it assembles into the
same as '16b'. (When copying a vector register to another, the element
size in the vectors don't matter.)
This fixes building with gnu binutils.