According to the Win64 ABI, these registers need to be preserved,
and compilers are allowed to rely on their content to stay
available - not only for float usage but for any usage, anywhere,
in the calling C++ code.
This adds a macro which pushes the clobbered registers onto the
stack if targeting win64 (and a matching one which restores them).
The parameter to the macro is the number of xmm registers used
(e.g. if using xmm0 - xmm7, the parameter is 8), or in other
words, the number of the highest xmm register used plus one.
This is similar to how the same issue is handled for the NEON
registers q4-q7 with the vpush instruction, except that they needed
to be preserved on all platforms, not only on one particular platform.
This allows removing the XMMREG_PROTECT_* hacks, which can
easily fail if the compiler chooses to use the callee saved
xmm registers in an unexpected spot.
Remap q5 to q8, q6 to q9, q7 to q10 and q8 to q11, and push
q4 to the stack.
This was missed previously since the codec unittest doesn't
test encoding with loop filter enabled yet.
This code checked whether one function pointer was non-null,
but the went on to call a different function pointer. Check
for the one that actually was called.
There's no requirement to work with the Intel NDK. The NDK build
files don't even need to create any cpufeatures subdirectory since
all of the cpufeatures library is handled within the normal build
system. Finally, the encoder makefile tried to create a directory
for "welsdecdemo", not anything for "welsencdemo".
Thus, if this really was necessary, it would already have been noticed
that it missed an entry for "welsencdemo". Therefore, remove this
hack/workaround.
According to the calling convention, the registers q4-q7 should be
preserved by functions. The caller (generated by the compiler) could
be using those registers anywhere for any intermediate data.
Functions that use more than 12 of the qX registers must push
the clobbered registers on the stack in order to be able to restore them
afterwards.
In functions that don't use all 16 registers, but clobber some of
the callee saved registers q4-q7, one or more of them are remapped
to reduce the number of registers that have to be saved/restored.
This incurs a very small (around 0.5%) slowdown in the decoder and
encoder.
According to the calling convention, the registers q4-q7 should be
preserved by functions. The caller (generated by the compiler) could
be using those registers anywhere for any intermediate data.
Functions that use 12 or less of the qX registers can avoid
violating the calling convention by simply using other registers instead
of the callee saved registers q4-q7.
This change only remaps the registers used within functions - therefore
this does not affect performance at all. E.g. in functions using
registers q0-q7, we now use q0-q3 and q8-q11 instead.