Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.
Fall back to a generic approach for ratios > 4.
The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.
The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.
Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.
Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.
The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.
Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
Process 8 lines at a time rather than 16 lines at a time because
this appears to give more reliable memory subsystem performance on
Haswell.
Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell.
On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached,
which appears to be related to processing 8 lines at a time as opposed
to 16 lines at a time. The other routines are also faster as compared
to the SSE2 routines in this case but to a lesser extent.
The astyle configuration makes sure normal code is indented consistently
with 2 spaces, but astyle doesn't seem to touch the indentation in
these multi-line macros.
This makes sure we don't accidentally return the same sequence
of random numbers multiple times within one test (which would
be very non-random).
Every time srand(time()) is called, the pseudo random number
generator is initialized to the same value (as long as time()
returned the same value).
By initializing the random number generator once and for all
before starting to run the unit tests, we are sure we don't
need to reinitialize it within all the tests and all the
functions that use random numbers.
This fixes occasional errors in MotionEstimateTest.
MotionEstimateTest was designed to allow the test to occasionally
not succeed - if it didn't succeed, it tried again, up to 100 times.
However, since the YUVPixelDataGenerator function reset the random
seed to time(), every attempt actually ran with the same random
data (as long as all 100 attempts ran within 1 second) - thus if
one attempt in MotionEstimateTest failed, all 100 of them would
fail. If the utility functions don't touch the random seed,
this is not an issue.