Commit Graph

7 Commits

Author SHA1 Message Date
Ronald S. Bultje
e84d14df10 vp9/x86: idct_32x32_add_ssse3.
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
2014-01-07 20:43:30 -05:00
Ronald S. Bultje
18175baa54 vp9/x86: 16px MC functions (64bit only).
Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0):
16x8:    876 ->   870  (0.7%)
16x16:  1444 ->  1435  (0.7%)
16x32:  2784 ->  2748  (1.3%)
32x16:  2455 ->  2349  (4.5%)
32x32:  4641 ->  4084 (13.6%)
32x64:  9200 ->  7834 (17.4%)
64x32:  8980 ->  7197 (24.8%)
64x64: 17330 -> 13796 (25.6%)
Total decoding time goes from 9.326sec to 9.182sec.
2013-12-26 21:05:10 -05:00
Ronald S. Bultje
8d4c616fc0 vp9/x86: idct_add_16x16_ssse3.
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
2013-12-14 12:13:26 -05:00
Clément Bœsch
d28c79b003 avcodec/x86/vp9dsp: use EXTERNAL_* macros.
Original fix by one of these developers:
    Anton Khirnov <anton@khirnov.net>
    Diego Biurrun <diego@biurrun.de>
    Luca Barbato <lu_zero@gentoo.org>
    Martin Storsjö <martin@martin.st>

See 97962b2 / 72ca830

Personnal guess is Diego Biurrun.
2013-11-16 17:03:17 +01:00
Clément Bœsch
87434cf373 avcodec/vp9: add ff_vp9_idct_idct_{4x4,8x8}_ssse3().
1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips
1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips
1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips

529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips
516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips
474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips

(~3.9x faster)

7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips
7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips
7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips

1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips
1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips
1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips

(~7.1x faster)

Overall decode time before commit:
  16.48s user 0.03s system 99% cpu 16.526 total
  16.54s user 0.01s system 99% cpu 16.566 total
  16.46s user 0.03s system 99% cpu 16.511 total

Overall decode time after commit:
  16.34s user 0.02s system 99% cpu 16.378 total
  16.28s user 0.02s system 99% cpu 16.315 total
  16.32s user 0.03s system 99% cpu 16.366 total

Tested on i7 920 with 40s 1080p footage.
2013-11-05 19:25:40 +01:00
Ronald S. Bultje
f1548c008f Full-pixel MC functions.
Decoding time of ped1080p.webm goes from 11.3sec to 11.1sec.
2013-10-02 21:03:15 -04:00
Ronald S. Bultje
c07ac8d467 VP9 MC (ssse3) optimizations.
Decoding time of ped1080p.webm goes from 20.7sec to 11.3sec.
2013-10-02 21:03:15 -04:00