Ronald S. Bultje
8d4c616fc0
vp9/x86: idct_add_16x16_ssse3.
...
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
2013-12-14 12:13:26 -05:00
Ronald S. Bultje
92436e8ad9
vp9: implement top/left half (4x4) sub-8x8-IDCT.
...
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from
668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
b2045c44a9
vp9: split pre-load of 11585x2 out of 1d idct macro.
...
This allows us to load it only once, instead of twice, in this function.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
f9a0d4c6e0
vp9: minor refactorings in idct ssse3 assembly.
...
Make register usage in macros explicit; change mulsub_2w_4x to use 2
instead of 3 temp registers.
2013-12-07 12:39:35 -05:00
Ronald S. Bultje
8729964b99
vp9: split x86 assembly in two files.
...
(And in future, loopfilter or intra pred could be put in their own
respective files also.)
2013-12-07 12:39:35 -05:00