Commit Graph

15 Commits

Author SHA1 Message Date
Ronald S. Bultje
8173d1ffc0 vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx).
Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct:  4672 -> 1175 cycles
idct_iadst:  4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.
2014-01-16 13:49:31 +01:00
Clément Bœsch
8b4190da93 vp9/x86: add AVX for itxfm and lpf.
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips

3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips

23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips

4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips

967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
2014-01-15 15:54:03 +01:00
Clément Bœsch
e11ceea68f vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X. 2014-01-12 20:19:00 +01:00
Clément Bœsch
c9aa0b8f70 vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X. 2014-01-12 20:18:55 +01:00
Clément Bœsch
7c55ee6168 vp9/x86: merge IDCT coef macros. 2014-01-12 20:18:44 +01:00
Ronald S. Bultje
c6fe984f2f vp9/x86: make STORE_2X2 macro local.
Prevents this assembler warning:
libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
redefining multi-line macro `STORE_2X2'

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-08 14:07:15 +01:00
Ronald S. Bultje
04a187fb2a vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct.
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.
2014-01-07 20:43:35 -05:00
Ronald S. Bultje
37b001d14d vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct.
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
2014-01-07 20:43:34 -05:00
Ronald S. Bultje
e84d14df10 vp9/x86: idct_32x32_add_ssse3.
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
2014-01-07 20:43:30 -05:00
Ronald S. Bultje
0d9375fc90 vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38).
Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
(intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
using ped1080p.webm.
2013-12-26 07:40:25 -05:00
Ronald S. Bultje
8d4c616fc0 vp9/x86: idct_add_16x16_ssse3.
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
2013-12-14 12:13:26 -05:00
Ronald S. Bultje
92436e8ad9 vp9: implement top/left half (4x4) sub-8x8-IDCT.
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from
668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
b2045c44a9 vp9: split pre-load of 11585x2 out of 1d idct macro.
This allows us to load it only once, instead of twice, in this function.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
f9a0d4c6e0 vp9: minor refactorings in idct ssse3 assembly.
Make register usage in macros explicit; change mulsub_2w_4x to use 2
instead of 3 temp registers.
2013-12-07 12:39:35 -05:00
Ronald S. Bultje
8729964b99 vp9: split x86 assembly in two files.
(And in future, loopfilter or intra pred could be put in their own
respective files also.)
2013-12-07 12:39:35 -05:00