Avoid X86_ASM ifdef.
Ideally, we may want to update all routines to average vertically
first, which would make this unnecessary. In the interim, this
enables the tests to run successfully on x86 without SSSE3 support
again.
X86 SIMD downsampling routines may, for convenience, read slightly beyond
the input data and into the alignment padding area beyond each line. This
causes valgrind to warn about uninitialized values even if these values
only affect lanes of a SIMD vector that are effectively never used.
Avoid these false positives by zero-initializing the padding area beyond
each line of the source buffer used for downsampling.
This fixes building for android differently than in f5e483ce.
On android, <cassert> isn't available in the normal include path,
only when the STL headers are available.
We intentionally avoid using STL within the main libopenh264.so, to
simplify dependency chains for users of the library (which otherwise
could run into conflicts if the surrounding app would want to use
a different STL implementation).
The previous fix only provided headers, not actually linking
against STL, so at this point it's not a real issue yet, but it's
still a very slippery slope towards accidentally starting relying on
STL within the core library.
Instead explicitly avoid using STL within the core library, by not
even providing the include path.
AFAICT, it is sufficient that the sample buffer has space for half
the source width/height. With the current sample buffer size, this
enables its use for resolutions up to 3840x2176.
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.
Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.
Minor tweaks.
Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.
Adjust tests because averaging vertically first gives slightly different
output.
~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.
Note that the widthx16 routine can be unrolled for further speedup.
Assume that data can be written into the padding area following each
line. This enables the use of faster routines for more cases.
Align downsample buffer stride to a multiple of 32.
With this all strides used should be a multiple of 16, which means
that use of narrower downsample routines can be dropped altogether.