The xmm reg count was incorrect, and manual loading of the gprs
furthermore allows to noticeable reduce the number needed.
The modified functions are used in weighted prediction, so only a
few samples like WP_* exhibit a change. For this one and Win64
(some widths removed because of too few occurrences):
WP_A_Toshiba_3.bit, ff_hevc_put_hevc_uni_w
16 32
before: 2194 3872
after: 2119 3767
WP_B_Toshiba_3.bit, ff_hevc_put_hevc_bi_w
16 32 64
before: 2819 4960 9396
after: 2617 4788 9150
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '512f3ffe9b4bb86767c2b1176554407c75fe1a5c':
dsputil: Split off HuffYUV encoding bits into their own context
Conflicts:
configure
libavcodec/dsputil.c
libavcodec/dsputil.h
libavcodec/huffyuv.h
libavcodec/huffyuvenc.c
libavcodec/pngenc.c
libavcodec/x86/dsputilenc_mmx.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Loading pb_1 rather than pw_8192 was benchmarked to be more efficient.
Loading of the 2 yields no advantage. Loading of one saves ~11 cycles.
decicycles count:
put8: 3223(mmx) -> 2387
avg8: 2863(mmxext) -> 2125
put16: 4356(sse2) -> 3553
avg16: 4481(sse2) -> 3513
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
No inline asm dirac code remains in the tree, so replace every relevant check.
This also moves all the dirac functions from dsputil_mmx.c to diracdsp_mmx.c
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This is a port of the inline assembly of the mmx version to use the
pavg(us|)b instruction.
8 16
mmx 1498 4355
mmx2 1242 3509
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Currently, only the mmx version is bitexact, the others (mmxext and
3dnow) are not, in spite of their naming.
Therefore, make their name more obvious. Also restore a comment that
was removed in 71155d7b.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
By default, macro EPEL_FILTER loads the coefficients inconditionally
into m14/m15. This forces an unneeded higher register count.
Reduce that count by making them parameters of EPEL_FILTER.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
The second argument is a temp register for non-SSSE3 cases
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Including stdint.h was enough for systems like Mingw, but apparently not for Linux.
This should fix make checkheaders failures on every platform
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
None of the handwritten asm in this function seems to be SSE2
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This allows further unrolling the DSP implementation where possible.
x86 and ARM DSP modified by simply moving the multiple calls from vc1dec
to the DSP code. Decoding improvements should only occurs because of the
compiler actually able to unroll more.
Decoding time: ~8.80s -> 8.64s (ie around 2%)
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
In the 2 FILTER_INIT usages, the source is already preloaded so that
extra complexity taken from FILTER_UPDATE is not necessary.
Also add forgotten "mask" argument in FILTER_{INIT,UPDATE} comments.
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Also port relevant AVX2/XOP optimizations from x264 with permission
to relicense to LGPL from the corresponding authors
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
movq from SSE register to memory is an SSE2 instruction.
Instead, use SSE movlps, which does the same thing.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '57b5b84e208ad61ffdd74ad849bed212deb92bc5':
x86: dsputil: Move ff_apply_window_int16_* bits to ac3dsp, where they belong
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '01c5779f56cf708e6cb88b11cfdc248cae7e2ee8':
x86: Drop some unnecessary YASM ifdefs
Conflicts:
libavfilter/x86/vf_yadif_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '3dc6272bed7890a49080e18eacf3c7a4a6594b0d':
Remove a number of unnecessary dsputil.h #includes
Conflicts:
libavcodec/h264pred.c
libavcodec/vc1dsp.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Fixes compilation failures with "--disable-{avx,fma3} --disable-optimizations"
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Sandy Bridge Win64:
180 cycles in ff_synth_filter_inner_sse2
150 cycles in ff_synth_filter_inner_avx
Also switch some instructions to a three operand format to avoid
assembly errors with Yasm 1.1.0 or older.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Further performance improvements and security fixes by
Vittorio Giovara, Luca Barbato and Diego Biurrun.
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
* commit '82dd1026cfc1d72b04019185bea4c1c9621ace3f':
x86: dsputil: Move hpeldsp-related declarations to a separate header
Conflicts:
libavcodec/x86/dsputil_x86.h
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '6655c933a887a2d20707fff657b614aa1d86a25b':
x86: dsputil: Move fpel declarations to a separate header
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '1a8d0cf77ed2611e542ae98f341d4c43a04467bd':
x86: dsputil: Move inline assembly macros to a separate header
Conflicts:
libavcodec/x86/dsputil_mmx.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '82bb3048013201c0095d2853d4623633d912252f':
dsputil: Use correct type in me_cmp_func function pointer
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '0e083d7e43805db1a978cb57bfa25fda62e8ff18':
build: Group general components separate from de/encoders in arch Makefiles
Conflicts:
libavcodec/arm/Makefile
libavcodec/x86/Makefile
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '5169e688956be3378adb3b16a93962fe0048f1c9':
dsputil: Propagate bit depth information to all (sub)init functions
Conflicts:
libavcodec/arm/dsputil_init_arm.c
libavcodec/arm/dsputil_init_armv5te.c
libavcodec/arm/dsputil_init_armv6.c
libavcodec/arm/dsputil_init_neon.c
libavcodec/dsputil.c
libavcodec/dsputil.h
libavcodec/ppc/dsputil_ppc.c
libavcodec/x86/dsputil_init.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Replace mulps+subps with fnmaddps, resulting in two less instructions inside the
inner loops.
About 1% faster FMA3 performance.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'db3f61a04f1f66746660f921bb2780ddf1141f3b':
x86: dsputil_init: Drop some unnecessary parentheses
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '6a8b35dc88b4a1a452f192fbbf53ae7f59bc3f23':
dsputilenc_mmx: Merge two assignment blocks with identical conditions
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '55519926ef855c671d084ccc151056de9e3d3a77':
x86: Make function prototype comments in assembly code consistent
Conflicts:
libavcodec/x86/sbrdsp.asm
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'edd1f833fa145eb9c5026877c699ebe6efca00a0':
x86: h264_idct_10_bit: Use proper type in function prototype comments
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '831a1180785a786272cdcefb71566a770bfb879e':
Update dsputil- and SIMD-related comments to match reality more closely
Conflicts:
libavcodec/x86/hpeldsp.asm
libavutil/arm/float_dsp_init_arm.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Also undo the changes to ra144enc.c from previous commits.
Should fix ticket #3429
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '3741aa37c2a0d0717faff74a5c4cc357d16f6d1d':
x86: cabac: Use correct #includes to make header compile standalone
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Should fix compilation failures with --disable-yasm on some compilers
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This reverts the changes 6467209836
and 68c3ed936a did to the SSE2 version,
which generated a hit of about 5 cycles.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Sandy Bridge Win64:
180 cycles on ff_synth_filter_inner_sse2
150 cycles on ff_synth_filter_inner_avx
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Timings for Arrandale:
C SSE
win32: 2108 334
win64: 1152 322
Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.
Unrolling for ARCH_X86_64 is a 20 cycles gain.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
The vector dequantization has a test in a loop preventing effective SIMD
implementation. By moving it out of the loop, this loop can be DSPized.
Therefore, modify the current DSP implementation. In particular, the
DSP implementation no longer has to handle null loop sizes.
The decode_hf implementations have following timings:
For x86 Arrandale:
C SSE SSE2 SSE4
win32: 260 162 119 104
win64: 242 N/A 89 72
The arm NEON optimizations follow in a later patch as external asm. The
now unused check for the y modifier in arm inline asm is removed from
configure.
Timings for Arrandale:
C SSE
win32: 2108 334
win64: 1152 322
Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.
Unrolling for ARCH_X86_64 is a 20 cycles gain.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
There's an SSE2 version as well, and x86_64 guarantees that
instruction set is present.
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
We need the emulation to support the cases where the first
argument is the same as the fourth. To achieve this a fifth
argument working as a temporary may be needed.
Emulation that doesn't obey the original instruction semantics
can't be in x86inc.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '017a06a9ee86b047079166c2694c9c655ff03356':
x86: dsputil: Use correct file name as multiple inclusion guard
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Results are from a Win64 build running on an AMD FX 6300
1121 decicycles in ttafilter_process_dec_c, 16777112 runs, 104 skips
522 decicycles in ff_ttafilter_process_dec_ssse3, 16777149 runs, 67 skips
477 decicycles in ff_ttafilter_process_dec_sse4, 16777156 runs, 60 skips
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Fixes compilation with flac decoder disabled and encoder enabled
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Tested on an AMD FX 6300
679081 decicycles in ff_flac_lpc_32_xop, 32768 runs
774425 decicycles in ff_flac_lpc_32_sse4, 32768 runs
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '5c1c6e82261b856214499b9fef3a08bf3ff6e0ae':
dca: include dcadsp.h in {arm,x86}/dca.h for checkheaders
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '0cffd6fff59f192120dc93aa6c3cb8180f5506e3':
x86: use the inline int8x8_fmul_int32 only if inline SSE2 is availbale
Merged-by: Michael Niedermayer <michaelni@gmx.at>
For the callable function (as opposed to the inline one):
C SSE SSE2 SSE4
Win32: 47 42 29 26
Win64: 30 33 25 23
The SSE version is neither compiled nor set for ARCH_X86_64, as the
inlinable function takes over.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
benchmarked on sandybridge x86_64:
1358232 decicycles in flac_lpc_32_c
1244575 decicycles in flac_lpc_32_sse4, James Almer's patch
650045 decicycles in flac_lpc_32_sse4, this patch
I haven't tested the edgecases such as odd block lengths
odd block length tested-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Should fix crashes or corrupt output on pre-SSE2 CPUs when they were
using SSE2-code (e.g. AMD Athlon XP 2400+ or Intel Pentium III) in
hfix or hvar single-edge (left/right) extension functions.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips
9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips
1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips
2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips
5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1)
Adding SSE2 support should be relatively trivial (just a matter of
changing the pshufb [mask_mix] with something else), patch welcome.
Introduce 2 additional registers for stride3 and mstride3 to allow
direct accesses (lea drops).
3931 → 3827 decicycles in ff_vp9_loop_filter_v_16_16_ssse3
Also uses defines to clarify the code.
Before this patch, we explicitly modify rsp, which isn't necessarily
universally acceptable, since the space under the stack pointer might
be modified in things like signal handlers. Therefore, use an explicit
register to hold the stack pointer relative to the bottom of the stack
(i.e. rsp). This will also clear out valgrind errors about the use of
uninitialized data that started occurring after the idct16x16/ssse3
optimizations were first merged.
Fixes these errors with nasm:
libavcodec/x86/lossless_videodsp.asm:86: error: invalid combination of opcode and operands
libavcodec/x86/lossless_videodsp.asm:88: error: invalid combination of opcode and operands
I don't know whether movd or movq was meant, but either way
maskq vs. maskd must match the mov size.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
pavgb is an sse integer instruction, so the mmxext flag is enough
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7':
x86: Consistently use cpu flag detection macros in places that still miss it
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
x86: avcodec: Add a bunch of missing #includes for av_cold
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
* commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb':
h264: do not use 422 functions for monochrome
See: 07abf13da4
Merged-by: Michael Niedermayer <michaelni@gmx.at>