Sindre Aamås 57fc3e9917 [Processing] Add AVX2 VAA routines
Process 8 lines at a time rather than 16 lines at a time because
this appears to give more reliable memory subsystem performance on
Haswell.

Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell.
On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached,
which appears to be related to processing 8 lines at a time as opposed
to 16 lines at a time. The other routines are also faster as compared
to the SSE2 routines in this case but to a lesser extent.
2016-04-11 16:09:56 +02:00
..