Clément Bœsch
62d31307c1
avcodec/x86/vp9lpf: add a comment above a bunch of SWAP.
2014-04-20 21:33:58 +02:00
Clément Bœsch
f0d368d758
avcodec/x86/vp9lpf: merge a few movs with other instructions.
2014-04-20 21:29:11 +02:00
Clément Bœsch
010732b73a
vp9/x86: simplify FILTER_INIT.
...
In the 2 FILTER_INIT usages, the source is already preloaded so that
extra complexity taken from FILTER_UPDATE is not necessary.
Also add forgotten "mask" argument in FILTER_{INIT,UPDATE} comments.
2014-04-19 17:30:33 +02:00
Clément Bœsch
b8d002dc95
vp9/x86: clarify mixed splatb.
2014-04-19 17:00:51 +02:00
Clément Bœsch
669d4f9053
x86/vp9lpf: simplify 2nd transpose in 44/48/88/84.
...
For non-avx optims, this saves 8 movs.
before:
1785 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524129 runs, 159 skips
3327 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262116 runs, 28 skips
2712 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193729 runs, 575 skips
3237 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524061 runs, 227 skips
after:
1768 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524062 runs, 226 skips
3310 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262107 runs, 37 skips
2719 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193954 runs, 350 skips
3184 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524236 runs, 52 skips
2014-02-08 11:10:23 +01:00
Clément Bœsch
d92a725329
x86/vp9lpf: remove 8 SWAPs in 84/48 transpose.
2014-02-05 07:21:13 +01:00
Clément Bœsch
97dde561de
x86/vp9lpf: remove braindead double pxor.
2014-02-05 07:21:11 +01:00
Clément Bœsch
9a3b05b0a9
x86/vp9lpf: save a few mov in flat8in/hev masks calc.
2014-02-05 07:21:09 +01:00
Clément Bœsch
91d85bb167
x86/vp9lpf: add ff_vp9_loop_filter_[vh]_44_16_{sse2,ssse3,avx}.
2014-02-05 07:21:06 +01:00
Clément Bœsch
c5dd73b890
x86/vp9lpf: add ff_vp9_loop_filter_h_{48,84}_16_{sse2,ssse3,avx}().
...
5.40s → 5.30s overall decode time with -threads 1 on ped1080p.webm
(i7 920, ssse3)
2014-01-30 19:34:13 +01:00
James Almer
644c32ea4b
x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_sse2()
...
Similar gains as the ssse3 version once again
Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-28 09:30:55 +01:00
Clément Bœsch
222c46c531
x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_{ssse3,avx}.
...
9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips
9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips
1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips
2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips
5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1)
Adding SSE2 support should be relatively trivial (just a matter of
changing the pshufb [mask_mix] with something else), patch welcome.
2014-01-28 07:36:38 +01:00
Clément Bœsch
822385d775
x86/vp9lpf: add a preload system in FILTER_UPDATE.
...
Allow some macro refactoring in filter14().
2014-01-27 22:39:26 +01:00
Clément Bœsch
315b4775ad
x86/vp9lpf: refactor v/h using common macros for P7 to Q7.
2014-01-27 22:39:26 +01:00
Clément Bœsch
5d144086cc
x86/vp9lpf: faster P7..Q7 accesses.
...
Introduce 2 additional registers for stride3 and mstride3 to allow
direct accesses (lea drops).
3931 → 3827 decicycles in ff_vp9_loop_filter_v_16_16_ssse3
Also uses defines to clarify the code.
2014-01-27 22:37:42 +01:00
James Almer
d2a7314f1e
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2().
...
Similar gains in performance as the SSSE3 version
Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-17 14:16:38 +01:00
Clément Bœsch
8b4190da93
vp9/x86: add AVX for itxfm and lpf.
...
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips
3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips
23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips
4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips
967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
2014-01-15 15:54:03 +01:00
Clément Bœsch
af68bd1c06
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3().
...
16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips
17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips
4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips
3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips
Overall decode time goes from:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 8.10s user 0.02s system 99% cpu 8.126 total
to:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 6.15s user 0.04s system 99% cpu 6.199 total
(46 to 61 fps)
2014-01-12 20:20:24 +01:00