Scott LaVarnway a5e97d874b VP9_COPY_CONVOLVE_SSE2 optimization
This function suffers from a couple problems in small core(tablets):
-The load of the next iteration is blocked by the store of previous iteration
-4k aliasing (between future store and older loads)
-current small core machine are in-order machine and because of it the store will spin the rehabQ until the load is finished
fixed by:
- prefetching 2 lines ahead
- unroll copy of 2 rows of block
- pre-load all xmm regiters before the loop, final stores after the loop
The function is optimized by:
copy_convolve_sse2 64x64 - 16%
copy_convolve_sse2 32x32 - 52%
copy_convolve_sse2 16x16 - 6%
copy_convolve_sse2 8x8 - 2.5%
copy_convolve_sse2 4x4 - 2.7%
credit goes to Tom Craver(tom.r.craver@intel.com) and Ilya Albrekht(ilya.albrekht@intel.com)

Change-Id: I63d3428799c50b2bf7b5677c8268bacb9fc29671
2015-07-31 14:51:51 -07:00
..
2015-07-31 10:27:33 -07:00
2015-05-06 16:58:20 -07:00
2015-05-06 16:58:20 -07:00
2015-05-06 16:58:20 -07:00
2015-05-06 16:58:20 -07:00
2015-05-06 16:58:20 -07:00
2015-05-06 16:58:20 -07:00
2015-07-26 08:26:32 -07:00
2015-07-31 10:27:33 -07:00