57fc3e9917
Process 8 lines at a time rather than 16 lines at a time because this appears to give more reliable memory subsystem performance on Haswell. Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell. On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached, which appears to be related to processing 8 lines at a time as opposed to 16 lines at a time. The other routines are also faster as compared to the SSE2 routines in this case but to a lesser extent.