Should fix crashes or corrupt output on pre-SSE2 CPUs when they were
using SSE2-code (e.g. AMD Athlon XP 2400+ or Intel Pentium III) in
hfix or hvar single-edge (left/right) extension functions.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips
9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips
1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips
2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips
5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1)
Adding SSE2 support should be relatively trivial (just a matter of
changing the pshufb [mask_mix] with something else), patch welcome.
Introduce 2 additional registers for stride3 and mstride3 to allow
direct accesses (lea drops).
3931 → 3827 decicycles in ff_vp9_loop_filter_v_16_16_ssse3
Also uses defines to clarify the code.
Before this patch, we explicitly modify rsp, which isn't necessarily
universally acceptable, since the space under the stack pointer might
be modified in things like signal handlers. Therefore, use an explicit
register to hold the stack pointer relative to the bottom of the stack
(i.e. rsp). This will also clear out valgrind errors about the use of
uninitialized data that started occurring after the idct16x16/ssse3
optimizations were first merged.
Fixes these errors with nasm:
libavcodec/x86/lossless_videodsp.asm:86: error: invalid combination of opcode and operands
libavcodec/x86/lossless_videodsp.asm:88: error: invalid combination of opcode and operands
I don't know whether movd or movq was meant, but either way
maskq vs. maskd must match the mov size.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
pavgb is an sse integer instruction, so the mmxext flag is enough
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7':
x86: Consistently use cpu flag detection macros in places that still miss it
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
x86: avcodec: Add a bunch of missing #includes for av_cold
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
* commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb':
h264: do not use 422 functions for monochrome
See: 07abf13da4
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
(intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
using ped1080p.webm.
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
The mmxext optimizations should be at least equally fast if available and
amd3dnow optimizations are being deprecated. Thus the former should
override the latter, not the other way around.
* qatar/master:
dsputil: x86: Move ff_inv_zigzag_direct16 table init to mpegvideo
If someone optimizes dct_quantize for non x86 SIMD, then this
probably needs to be reverted.
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Original fix by one of these developers:
Anton Khirnov <anton@khirnov.net>
Diego Biurrun <diego@biurrun.de>
Luca Barbato <lu_zero@gentoo.org>
Martin Storsjö <martin@martin.st>
See 97962b2 / 72ca830
Personnal guess is Diego Biurrun.
* commit '458446acfa1441d283dacf9e6e545beb083b8bb0':
lavc: Edge emulation with dst/src linesize
Conflicts:
libavcodec/cavs.c
libavcodec/h264.c
libavcodec/hevc.c
libavcodec/mpegvideo_enc.c
libavcodec/mpegvideo_motion.c
libavcodec/rv34.c
libavcodec/svq3.c
libavcodec/vc1dec.c
libavcodec/videodsp.h
libavcodec/videodsp_template.c
libavcodec/vp3.c
libavcodec/vp8.c
libavcodec/wmv2.c
libavcodec/x86/videodsp.asm
libavcodec/x86/videodsp_init.c
Changes to the asm are not merged, they are left for volunteers or
in their absence for later.
The changes this merge introduces are reordering of the function
arguments
See: face578d56
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Originally written by Ronald S. Bultje <rsbultje@gmail.com> and
Clément Bœsch <u@pkh.me>
Further contributions by:
Anton Khirnov <anton@khirnov.net>
Diego Biurrun <diego@biurrun.de>
Luca Barbato <lu_zero@gentoo.org>
Martin Storsjö <martin@martin.st>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Allow supporting files for which the image stride is smaller than
the maximum block size + number of subpel mc taps, e.g. a 64x64 VP9
file or a 16x16 VP8 file with -fflags +emu_edge.
XvMC has long ago been superseded by newer acceleration APIs, such as
VDPAU, and few downstreams still support it. Furthermore XvMC is not
implemented within the hwaccel framework, but requires its own specific
code in the MPEG-1/2 decoder, which is a maintenance burden.
* commit '0338c396987c82b41d322630ea9712fe5f9561d6':
dsputil: Split off H.263 bits into their own H263DSPContext
Conflicts:
configure
libavcodec/mpegvideo.h
libavcodec/mpegvideo_enc.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips
1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips
1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips
529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips
516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips
474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips
(~3.9x faster)
7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips
7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips
7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips
1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips
1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips
1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips
(~7.1x faster)
Overall decode time before commit:
16.48s user 0.03s system 99% cpu 16.526 total
16.54s user 0.01s system 99% cpu 16.566 total
16.46s user 0.03s system 99% cpu 16.511 total
Overall decode time after commit:
16.34s user 0.02s system 99% cpu 16.378 total
16.28s user 0.02s system 99% cpu 16.315 total
16.32s user 0.03s system 99% cpu 16.366 total
Tested on i7 920 with 40s 1080p footage.
* commit 'e2b5b097898c9155f4bdff4d83cdc54d5eef6930':
x86: rv40dsp: Use PAVGB instruction macro where appropriate
Merged-by: Michael Niedermayer <michaelni@gmx.at>
There are instructions pavgb and pavgusb. Both instructions do the same
operation but they have different enconding. Pavgb exists in SSE (or
MMXEXT) instruction set and pavgusb exists in 3D-NOW instruction set.
livavcodec uses the macro PAVGB to select the proper instruction. However,
the function avg_pixels8_xy2 doesn't use this macro, it uses pavgb
directly.
As a consequence, the function avg_pixels8_xy2 crashes on AMD K6-2 and
K6-3 processors, because they have pavgusb, but not pavgb.
This bug seems to be introduced by commit
71155d7b41, "dsputil: x86: Convert mpeg4
qpel and dsputil avg to yasm"
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '1700b4e678ed329611a16b20d11e64b7abda4839':
x86: vp8dsp: Split loopfilter code into a separate file
Conflicts:
libavcodec/x86/Makefile
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Fixes overreads in HEVC
Fixes Ticket3070
Also fixed remaining issues from Ticket3075 and Ticket3076
Some lines of code taken from 0c5f839693da2276c2da23400f67a67be4ea0af1:libavcodec/x86/cabac.h
and 0c5f839693da2276c2da23400f67a67be4ea0af1:libavcodec/cabac_functions.h
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Don't use word-size multiplications if size == 2, and if we're using
SIMD instructions (size >= 8), complete leftover 4byte sets using movd,
not mov. Both of these changes lead to minor speedups.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
this might fix Ticket2999 as well as some fate clients
untested as the original patch submitter no longer has the environment to test
this should be reverted if it does not fix the issues
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Should fix crashes or corrupt output on pre-SSE2 CPUs when they were
using SSE2-code (e.g. AMD Athlon XP 2400+ or Intel Pentium III) in
hfix or hvar single-edge (left/right) extension functions.
Tested-by: Ingo Brückl <ib@wupperonline.de>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'bbe4a6db44f0b55b424a5cc9d3e89cd88e250450':
x86inc: Utilize the shadow space on 64-bit Windows
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Store XMM6 and XMM7 in the shadow space in functions that
clobbers them. This way we don't have to adjust the stack
pointer as often, reducing the number of instructions as
well as code size.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
* commit '2ddb35b91131115c094d90e04031451023441b4d':
x86: dsputil: Separate ff_add_hfyu_median_prediction_cmov from dsputil_mmx
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '258414d0771845d20f646ffe4d4e60f22fba217c':
x86: fdct: Initialize optimized fdct implementations in the standard way
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This allows supporting files for which the image stride is smaller than
the max. block size + number of subpel mc taps, e.g. a 64x64 VP9 file
or a 16x16 VP8 file with -fflags +emu_edge.
The volatile is not required here, and prevents a miscompilation with GCC
4.8.1 when building on x86 with --cpu=i686
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
Currently all uses of the emu edge code as well as the code itself
assume int linesize
changing some but not changing all would introduce a security issue
once all use this typedef a simple search and replace can be
done to switch them all to ptrdiff_t
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '6369ba3c9cc74becfaad2a8882dff3dd3e7ae3c0':
x86: avcodec: Use convenience macros to check for CPU flags
Conflicts:
libavcodec/x86/dsputil_init.c
libavcodec/x86/hpeldsp_init.c
libavcodec/x86/motion_est.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
Consistently use "cpu_flags" as variable/parameter name for CPU flags
Conflicts:
libavcodec/x86/dsputil_init.c
libavcodec/x86/h264dsp_init.c
libavcodec/x86/hpeldsp_init.c
libavcodec/x86/motion_est.c
libavcodec/x86/mpegvideo.c
libavcodec/x86/proresdsp_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'e6d8acf6a8fba4743eb56eabe72a741d1bbee3cb':
indeo: use a typedef for the mc function pointer
cabac: x86 version of get_cabac_bypass
aic: use chroma scan tables while decoding luma component in progressive mode
Conflicts:
libavcodec/aic.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The volatile is not required here, and prevents a miscompilation with GCC
4.8.1 when building on x86 with --cpu=i686
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>