Ronald S. Bultje
97474d527f
vp9/x86: iwht4x4 (lossless) mmx.
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
d43efa68bd
vp9/x86: 4x4 iadst SIMD (ssse3) variants.
...
Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct: 66 -> 67 cycles (noise measurement)
idct_iadst: 199 -> 79 cycles
iadst_idct: 165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
baf47020cd
vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants.
...
Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct: 133 -> 135 cycles (noise measurement)
idct_iadst: 900 -> 241 cycles
iadst_idct: 864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles
2014-01-24 19:25:25 -05:00
Michael Niedermayer
e6d1c66d74
avcodec/x86/lossless_videodsp: disable median optimizations for 16bps
...
They only support upto 15bps
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:51:24 +01:00
Michael Niedermayer
eaacfc7dd1
avcodec/lossless_videodsp: Pass AVCodecContext to init
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:43:00 +01:00
Michael Niedermayer
ef00ef7553
avcodec/x86/lossless_videodsp: port sub_hfyu_median_prediction_int16 to yasm
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 23:27:27 +01:00
Michael Niedermayer
fad49aae28
avcodec/x86/lossless_videodsp: Port sub_hfyu_median_prediction_mmxext to int16
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 22:55:49 +01:00
Michael Niedermayer
fee97f25fa
avcodec/x86/lossless_videodsp: port add_hfyu_median_prediction_mmxext to 16bit
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 21:11:40 +01:00
Michael Niedermayer
631939bde6
avcodec/x86/lossless_videodsp: add diff_int16_mmx/sse2
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 19:41:21 +01:00
Reimar Döffinger
76421982d0
lossless_videodsp.asm: fix compilation.
...
Fixes these errors with nasm:
libavcodec/x86/lossless_videodsp.asm:86: error: invalid combination of opcode and operands
libavcodec/x86/lossless_videodsp.asm:88: error: invalid combination of opcode and operands
I don't know whether movd or movq was meant, but either way
maskq vs. maskd must match the mov size.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
2014-01-21 19:46:02 +01:00
Michael Niedermayer
83b67ca056
avcodec/x86/lossless_videodsp: Port lorens add_hfyu_left_prediction_ssse3/sse4 to 16bit
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:55:41 +01:00
Michael Niedermayer
63d2be7533
avcodec/x86/lossless_videodsp: use SPLATW in add_int16
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:33:20 +01:00
Michael Niedermayer
f70d7eb20c
Move add/diff_int16 to lossless_videodsp
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 21:32:47 +01:00
Michael Niedermayer
a493f8541d
avcodec/x86/dsp: add_int16_mmx / add_int16_sse2
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 04:06:46 +01:00
James Almer
26800e3864
vp9/x86: rename ff_avg[48]_sse to ff_avg[48]_mmxext
...
pavgb is an sse integer instruction, so the mmxext flag is enough
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-18 17:08:25 +01:00
James Almer
d2a7314f1e
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2().
...
Similar gains in performance as the SSSE3 version
Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-17 14:16:38 +01:00
Ronald S. Bultje
8173d1ffc0
vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx).
...
Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct: 4672 -> 1175 cycles
idct_iadst: 4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.
2014-01-16 13:49:31 +01:00
Clément Bœsch
9cc8fa63dd
vp9/x86: simplify a few mc inits.
2014-01-16 07:48:27 +01:00
Michael Niedermayer
6391dec82a
Merge remote-tracking branch 'qatar/master'
...
* qatar/master:
x86: dsputil: Simplify xvmc deprecation conditional
Conflicts:
libavcodec/x86/dsputil_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 20:41:08 +01:00
Diego Biurrun
aab40bbfd5
x86: dsputil: Simplify xvmc deprecation conditional
2014-01-15 15:23:46 +01:00
Clément Bœsch
8b4190da93
vp9/x86: add AVX for itxfm and lpf.
...
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips
3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips
23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips
4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips
967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
2014-01-15 15:54:03 +01:00
Michael Niedermayer
cb613657ee
avcodec/x86/proresdsp_init: x86 prores IDCT is bitexact again
...
reenable it for for bitexact mode
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 15:59:00 +01:00
Michael Niedermayer
b148a39d55
Merge commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7'
...
* commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7':
x86: Consistently use cpu flag detection macros in places that still miss it
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 14:44:59 +01:00
Diego Biurrun
46bacb5cc6
x86: Consistently use cpu flag detection macros in places that still miss it
2014-01-14 00:04:58 +01:00
Clément Bœsch
af68bd1c06
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3().
...
16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips
17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips
4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips
3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips
Overall decode time goes from:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 8.10s user 0.02s system 99% cpu 8.126 total
to:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 6.15s user 0.04s system 99% cpu 6.199 total
(46 to 61 fps)
2014-01-12 20:20:24 +01:00
Clément Bœsch
e11ceea68f
vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X.
2014-01-12 20:19:00 +01:00
Clément Bœsch
c9aa0b8f70
vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X.
2014-01-12 20:18:55 +01:00
Clément Bœsch
7c55ee6168
vp9/x86: merge IDCT coef macros.
2014-01-12 20:18:44 +01:00
Michael Niedermayer
92b2404571
Merge commit '4c642d8d98703faf52983243098f35865e15b312'
...
* commit '4c642d8d98703faf52983243098f35865e15b312':
x86: hpeldsp: Add missing av_cold attribute to init function
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:32:53 +01:00
Michael Niedermayer
390452bab6
Merge commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b'
...
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
x86: avcodec: Add a bunch of missing #includes for av_cold
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:24:15 +01:00
Diego Biurrun
4c642d8d98
x86: hpeldsp: Add missing av_cold attribute to init function
2014-01-09 15:09:07 +01:00
Diego Biurrun
b0be1ae792
x86: avcodec: Add a bunch of missing #includes for av_cold
2014-01-09 15:09:07 +01:00
Ronald S. Bultje
c6fe984f2f
vp9/x86: make STORE_2X2 macro local.
...
Prevents this assembler warning:
libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
redefining multi-line macro `STORE_2X2'
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-08 14:07:15 +01:00
Ronald S. Bultje
04a187fb2a
vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct.
...
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.
2014-01-07 20:43:35 -05:00
Ronald S. Bultje
37b001d14d
vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct.
...
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
2014-01-07 20:43:34 -05:00
Ronald S. Bultje
e84d14df10
vp9/x86: idct_32x32_add_ssse3.
...
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
2014-01-07 20:43:30 -05:00
Michael Niedermayer
30056fd0be
Merge commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb'
...
* commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb':
h264: do not use 422 functions for monochrome
See: 07abf13da4a7c3d23ce6bc6542d72e6252161736
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-06 16:51:23 +01:00
Anton Khirnov
a03a642d5c
h264: do not use 422 functions for monochrome
...
Fixes invalid memory access.
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC:libav-stable@libav.org
2014-01-06 08:25:36 +01:00
Ronald S. Bultje
18175baa54
vp9/x86: 16px MC functions (64bit only).
...
Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0):
16x8: 876 -> 870 (0.7%)
16x16: 1444 -> 1435 (0.7%)
16x32: 2784 -> 2748 (1.3%)
32x16: 2455 -> 2349 (4.5%)
32x32: 4641 -> 4084 (13.6%)
32x64: 9200 -> 7834 (17.4%)
64x32: 8980 -> 7197 (24.8%)
64x64: 17330 -> 13796 (25.6%)
Total decoding time goes from 9.326sec to 9.182sec.
2013-12-26 21:05:10 -05:00
Ronald S. Bultje
0d9375fc90
vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38).
...
Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
(intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
using ped1080p.webm.
2013-12-26 07:40:25 -05:00
Ivan Kalvachev
1c63aed232
Convert XvMC to hwaccel v3
...
Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-22 22:03:47 +01:00
Michael Niedermayer
ce612fc186
Merge commit 'dfc50ac85e9d68a771b556297b7c411650206f3b'
...
* commit 'dfc50ac85e9d68a771b556297b7c411650206f3b':
x86: mpegvideo: move denoise_dct asm to mpegvideoenc
Conflicts:
libavcodec/x86/mpegvideo.c
libavcodec/x86/mpegvideoenc.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-20 23:44:31 +01:00
Anton Khirnov
dfc50ac85e
x86: mpegvideo: move denoise_dct asm to mpegvideoenc
...
This function is encoding-only.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2013-12-20 17:16:11 +01:00
Ronald S. Bultje
8d4c616fc0
vp9/x86: idct_add_16x16_ssse3.
...
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
2013-12-14 12:13:26 -05:00
Michael Niedermayer
8e70fdab36
Merge commit '4958f35a2ebc307049ff2104ffb944f5f457feb3'
...
* commit '4958f35a2ebc307049ff2104ffb944f5f457feb3':
dsputil: Move apply_window_int16 to ac3dsp
Conflicts:
libavcodec/arm/ac3dsp_init_arm.c
libavcodec/arm/ac3dsp_neon.S
libavcodec/x86/ac3dsp_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-09 04:12:40 +01:00
Diego Biurrun
4958f35a2e
dsputil: Move apply_window_int16 to ac3dsp
...
The (optimized) functions are used nowhere else.
2013-12-08 17:57:15 +01:00
Ronald S. Bultje
92436e8ad9
vp9: implement top/left half (4x4) sub-8x8-IDCT.
...
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from
668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
b2045c44a9
vp9: split pre-load of 11585x2 out of 1d idct macro.
...
This allows us to load it only once, instead of twice, in this function.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
f9a0d4c6e0
vp9: minor refactorings in idct ssse3 assembly.
...
Make register usage in macros explicit; change mulsub_2w_4x to use 2
instead of 3 temp registers.
2013-12-07 12:39:35 -05:00
Ronald S. Bultje
8729964b99
vp9: split x86 assembly in two files.
...
(And in future, loopfilter or intra pred could be put in their own
respective files also.)
2013-12-07 12:39:35 -05:00