Clément Bœsch
cddbfd2a95
x86/lossless_videodsp: simplify and explicit aligned/unaligned flags
2014-01-25 11:59:43 +01:00
Ronald S. Bultje
c9e6325ed9
vp9/x86: use explicit register for relative stack references.
...
Before this patch, we explicitly modify rsp, which isn't necessarily
universally acceptable, since the space under the stack pointer might
be modified in things like signal handlers. Therefore, use an explicit
register to hold the stack pointer relative to the bottom of the stack
(i.e. rsp). This will also clear out valgrind errors about the use of
uninitialized data that started occurring after the idct16x16/ssse3
optimizations were first merged.
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
97474d527f
vp9/x86: iwht4x4 (lossless) mmx.
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
d43efa68bd
vp9/x86: 4x4 iadst SIMD (ssse3) variants.
...
Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct: 66 -> 67 cycles (noise measurement)
idct_iadst: 199 -> 79 cycles
iadst_idct: 165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
baf47020cd
vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants.
...
Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct: 133 -> 135 cycles (noise measurement)
idct_iadst: 900 -> 241 cycles
iadst_idct: 864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles
2014-01-24 19:25:25 -05:00
Michael Niedermayer
e6d1c66d74
avcodec/x86/lossless_videodsp: disable median optimizations for 16bps
...
They only support upto 15bps
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:51:24 +01:00
Michael Niedermayer
eaacfc7dd1
avcodec/lossless_videodsp: Pass AVCodecContext to init
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:43:00 +01:00
Michael Niedermayer
ef00ef7553
avcodec/x86/lossless_videodsp: port sub_hfyu_median_prediction_int16 to yasm
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 23:27:27 +01:00
Michael Niedermayer
fad49aae28
avcodec/x86/lossless_videodsp: Port sub_hfyu_median_prediction_mmxext to int16
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 22:55:49 +01:00
Michael Niedermayer
fee97f25fa
avcodec/x86/lossless_videodsp: port add_hfyu_median_prediction_mmxext to 16bit
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 21:11:40 +01:00
Michael Niedermayer
631939bde6
avcodec/x86/lossless_videodsp: add diff_int16_mmx/sse2
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 19:41:21 +01:00
Reimar Döffinger
76421982d0
lossless_videodsp.asm: fix compilation.
...
Fixes these errors with nasm:
libavcodec/x86/lossless_videodsp.asm:86: error: invalid combination of opcode and operands
libavcodec/x86/lossless_videodsp.asm:88: error: invalid combination of opcode and operands
I don't know whether movd or movq was meant, but either way
maskq vs. maskd must match the mov size.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
2014-01-21 19:46:02 +01:00
Michael Niedermayer
83b67ca056
avcodec/x86/lossless_videodsp: Port lorens add_hfyu_left_prediction_ssse3/sse4 to 16bit
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:55:41 +01:00
Michael Niedermayer
63d2be7533
avcodec/x86/lossless_videodsp: use SPLATW in add_int16
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:33:20 +01:00
Michael Niedermayer
f70d7eb20c
Move add/diff_int16 to lossless_videodsp
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 21:32:47 +01:00
Michael Niedermayer
a493f8541d
avcodec/x86/dsp: add_int16_mmx / add_int16_sse2
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 04:06:46 +01:00
James Almer
26800e3864
vp9/x86: rename ff_avg[48]_sse to ff_avg[48]_mmxext
...
pavgb is an sse integer instruction, so the mmxext flag is enough
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-18 17:08:25 +01:00
James Almer
d2a7314f1e
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2().
...
Similar gains in performance as the SSSE3 version
Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-17 14:16:38 +01:00
Ronald S. Bultje
8173d1ffc0
vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx).
...
Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct: 4672 -> 1175 cycles
idct_iadst: 4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.
2014-01-16 13:49:31 +01:00
Clément Bœsch
9cc8fa63dd
vp9/x86: simplify a few mc inits.
2014-01-16 07:48:27 +01:00
Michael Niedermayer
6391dec82a
Merge remote-tracking branch 'qatar/master'
...
* qatar/master:
x86: dsputil: Simplify xvmc deprecation conditional
Conflicts:
libavcodec/x86/dsputil_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 20:41:08 +01:00
Diego Biurrun
aab40bbfd5
x86: dsputil: Simplify xvmc deprecation conditional
2014-01-15 15:23:46 +01:00
Clément Bœsch
8b4190da93
vp9/x86: add AVX for itxfm and lpf.
...
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips
3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips
23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips
4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips
967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
2014-01-15 15:54:03 +01:00
Michael Niedermayer
cb613657ee
avcodec/x86/proresdsp_init: x86 prores IDCT is bitexact again
...
reenable it for for bitexact mode
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 15:59:00 +01:00
Michael Niedermayer
b148a39d55
Merge commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7'
...
* commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7':
x86: Consistently use cpu flag detection macros in places that still miss it
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 14:44:59 +01:00
Diego Biurrun
46bacb5cc6
x86: Consistently use cpu flag detection macros in places that still miss it
2014-01-14 00:04:58 +01:00
Clément Bœsch
af68bd1c06
vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3().
...
16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips
17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips
4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips
3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips
Overall decode time goes from:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 8.10s user 0.02s system 99% cpu 8.126 total
to:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 6.15s user 0.04s system 99% cpu 6.199 total
(46 to 61 fps)
2014-01-12 20:20:24 +01:00
Clément Bœsch
e11ceea68f
vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X.
2014-01-12 20:19:00 +01:00
Clément Bœsch
c9aa0b8f70
vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X.
2014-01-12 20:18:55 +01:00
Clément Bœsch
7c55ee6168
vp9/x86: merge IDCT coef macros.
2014-01-12 20:18:44 +01:00
Michael Niedermayer
92b2404571
Merge commit '4c642d8d98703faf52983243098f35865e15b312'
...
* commit '4c642d8d98703faf52983243098f35865e15b312':
x86: hpeldsp: Add missing av_cold attribute to init function
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:32:53 +01:00
Michael Niedermayer
390452bab6
Merge commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b'
...
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
x86: avcodec: Add a bunch of missing #includes for av_cold
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:24:15 +01:00
Diego Biurrun
4c642d8d98
x86: hpeldsp: Add missing av_cold attribute to init function
2014-01-09 15:09:07 +01:00
Diego Biurrun
b0be1ae792
x86: avcodec: Add a bunch of missing #includes for av_cold
2014-01-09 15:09:07 +01:00
Ronald S. Bultje
c6fe984f2f
vp9/x86: make STORE_2X2 macro local.
...
Prevents this assembler warning:
libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
redefining multi-line macro `STORE_2X2'
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-08 14:07:15 +01:00
Ronald S. Bultje
04a187fb2a
vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct.
...
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.
2014-01-07 20:43:35 -05:00
Ronald S. Bultje
37b001d14d
vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct.
...
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
2014-01-07 20:43:34 -05:00
Ronald S. Bultje
e84d14df10
vp9/x86: idct_32x32_add_ssse3.
...
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
2014-01-07 20:43:30 -05:00
Michael Niedermayer
30056fd0be
Merge commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb'
...
* commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb':
h264: do not use 422 functions for monochrome
See: 07abf13da4
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-06 16:51:23 +01:00
Anton Khirnov
a03a642d5c
h264: do not use 422 functions for monochrome
...
Fixes invalid memory access.
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC:libav-stable@libav.org
2014-01-06 08:25:36 +01:00
Ronald S. Bultje
18175baa54
vp9/x86: 16px MC functions (64bit only).
...
Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0):
16x8: 876 -> 870 (0.7%)
16x16: 1444 -> 1435 (0.7%)
16x32: 2784 -> 2748 (1.3%)
32x16: 2455 -> 2349 (4.5%)
32x32: 4641 -> 4084 (13.6%)
32x64: 9200 -> 7834 (17.4%)
64x32: 8980 -> 7197 (24.8%)
64x64: 17330 -> 13796 (25.6%)
Total decoding time goes from 9.326sec to 9.182sec.
2013-12-26 21:05:10 -05:00
Ronald S. Bultje
0d9375fc90
vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38).
...
Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
(intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
using ped1080p.webm.
2013-12-26 07:40:25 -05:00
Ivan Kalvachev
1c63aed232
Convert XvMC to hwaccel v3
...
Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-22 22:03:47 +01:00
Michael Niedermayer
ce612fc186
Merge commit 'dfc50ac85e9d68a771b556297b7c411650206f3b'
...
* commit 'dfc50ac85e9d68a771b556297b7c411650206f3b':
x86: mpegvideo: move denoise_dct asm to mpegvideoenc
Conflicts:
libavcodec/x86/mpegvideo.c
libavcodec/x86/mpegvideoenc.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-20 23:44:31 +01:00
Anton Khirnov
dfc50ac85e
x86: mpegvideo: move denoise_dct asm to mpegvideoenc
...
This function is encoding-only.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2013-12-20 17:16:11 +01:00
Ronald S. Bultje
8d4c616fc0
vp9/x86: idct_add_16x16_ssse3.
...
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
2013-12-14 12:13:26 -05:00
Michael Niedermayer
8e70fdab36
Merge commit '4958f35a2ebc307049ff2104ffb944f5f457feb3'
...
* commit '4958f35a2ebc307049ff2104ffb944f5f457feb3':
dsputil: Move apply_window_int16 to ac3dsp
Conflicts:
libavcodec/arm/ac3dsp_init_arm.c
libavcodec/arm/ac3dsp_neon.S
libavcodec/x86/ac3dsp_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-09 04:12:40 +01:00
Diego Biurrun
4958f35a2e
dsputil: Move apply_window_int16 to ac3dsp
...
The (optimized) functions are used nowhere else.
2013-12-08 17:57:15 +01:00
Ronald S. Bultje
92436e8ad9
vp9: implement top/left half (4x4) sub-8x8-IDCT.
...
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from
668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
b2045c44a9
vp9: split pre-load of 11585x2 out of 1d idct macro.
...
This allows us to load it only once, instead of twice, in this function.
2013-12-07 12:39:36 -05:00