This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
the fact that only top-left 4x4 block contains non-zero coefficients,
and hence reduces the instructions needed.
The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
frames coded at 4000kbps, the average decoding speed goes up from
79.3 fps to 79.7 fps.
Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180