vp9_block_error_sse2 can only handle 16 bytes at a time but
the function requires to handle a sequence of 32 bytes at a time
so each 16 bytes is handled in a different register.
With AVX2 optimization the 32 bytes can be handled in one register instead
of two in the SSE2
The vp9_block_error was optimized by 85%.
The user level was optimized by 1.2%
Change-Id: Ia8fffe60e61eff7432a5fbd538757894f6c319fd