
use 64 bytes cache lines, reduce the main loop to 64-bytes instead of 128 bytes and adjust the prefetch distance to the optimal value.
use 64 bytes cache lines, reduce the main loop to 64-bytes instead of 128 bytes and adjust the prefetch distance to the optimal value.