Before
2843 decicycles in ff_sbr_autocorrelate_sse3, 262086 runs, 58 skips
After
2693 decicycles in ff_sbr_autocorrelate_sse3, 262117 runs, 27 skips
Signed-off-by: James Almer <jamrial@gmail.com>
x86inc can translate r*m into a register or stack on its own
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
Also a slight change to the ssse3 code, which prevents a theoretical
overflow in the sharp filter.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
These fix failures of --enable-xmm-clobber-test
It would be better to change the code to use fewer registers, but until
someone does the used register count must not be too small
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This fixes artifacts in the last pixel of rows with some widths and pixel formats
Found-by: Dominique Leroux <Dominique.Leroux@autodesk.com>
Tested-by: Dominique Leroux <Dominique.Leroux@autodesk.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
For test images manually generated to contain only up prediction,
timing results:
8380x3032 255x185
before: 138635 1992
after: 139232 1996
Actually jumping to the proper version depending on the alignment:
8380x3032: 138767
A 0.5% speed improvement for gigantic images is not worth the code
duplication.
Fixes ticket #4148
Signed-off-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Tested-by: Benoit Fouet <benoit.fouet@free.fr>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11674 -> 10877 decicycles on my Phenom II.
Overall speedup was unfortunately within measurement error.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Handle it inside the __asm__() block.
Fixes fate-vc1_ilaced_twomv when using the gcc-usan toolchain.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
cherry picked from commit df8ebe304df453f26c28ff8f11d607f49b90a4c2
Fixes out of array access
Fixes: asan_stack-oob_1046454_9_asan_stack-oob_15a9e7c_170_WP_MAIN10_B_Toshiba_3.bit
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
~15% faster.
Also add an mmxext version that takes advantage of the new code, and
build it alongside with the mmx version only on x86_32.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
It may be used by ff_add_pixels_clamped_sse2().
Should fix fate-cavs failures on some systems.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
Also add sse2 versions for both.
put_pixels_clamped port and sse2 version originally written by Timothy Gu.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
Same behavior as in simple_idct.
This way the best optimized versions available will be used instead.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
Roughly 25% faster MC than ssse3 for blocksizes 32 and 64.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Also add mmxext versions of vsad8 and vsad_intra8, and sse2 versions of
vsad16 and vsad_intra16.
Since vsad8 and vsad16 are not bitexact, they are accordingly marked as
approximate.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
No point in having the sad8 functions separate now that the loop is no
longer unrolled.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
This adds back support for 8x4 and 8x16
it does not support 8x2, i think nothing uses that
Found-by: ubitux
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Also add a missing c->pix_abs[0][0] initialization, and sse2 versions of
sad16_x2, sad16_y2 and sad16_xy2 (%15 to %20 faster than mmxext).
Since the _xy2 versions are not bitexact, they are accordingly marked as
approximate.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This lets the cglobal macro automatically append a suffix to the function name.
This means that INIT_XMM avx must be used rather than INIT_AVX.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '95c0cec03acec0a80cc1c7db48f3b2355d9e767b':
idctdsp: Add global function pointers for {add|put}_pixels_clamped functions
Conflicts:
libavcodec/arm/idctdsp_init_arm.c
libavcodec/dct.h
libavcodec/idctdsp.c
libavcodec/jrevdct.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
These function pointers already existed in the ARM code. Adding them globally
allows calls to the function pointers to access arch-optimized versions of the
functions transparently.
* commit 'dcb7c868ec7af7d3a138b3254ef2e08f074d8ec5':
cosmetics: Make naming scheme of Xvid IDCT consistent with other IDCTs
Conflicts:
libavcodec/mpeg4videodec.c
libavcodec/x86/Makefile
libavcodec/x86/dct-test.c
libavcodec/x86/xvididct_sse2.c
libavcodec/xvididct.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
In some cases, 2 or 3 calls are performed to functions for unusual
widths. Instead, perform 2 calls for different widths to split the
workload.
The 8+16 and 4+8 widths for respectively 8 and more than 8 bits can't
be processed that way without modifications: some calls use unaligned
buffers, and having branches to handle this was resulting in no
micro-benchmark benefit.
For block_w == 12 (around 1% of the pixels of the sequence):
Before:
12758 decicycles in epel_uni, 4093 runs, 3 skips
19389 decicycles in qpel_uni, 8187 runs, 5 skips
22699 decicycles in epel_bi, 32743 runs, 25 skips
34736 decicycles in qpel_bi, 32733 runs, 35 skips
After:
11929 decicycles in epel_uni, 4096 runs, 0 skips
18131 decicycles in qpel_uni, 8184 runs, 8 skips
20065 decicycles in epel_bi, 32750 runs, 18 skips
31458 decicycles in qpel_bi, 32753 runs, 15 skips
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* Reduced xmm register count to 7 (As such they are now enabled for x86_32).
* Removed four movdqa (affects the sse2 version only).
* pxor is now used to clear m0 only once.
~5% faster.
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
* commit 'efd26bedec9a345a5960dbfcbaec888418f2d4e6':
build: Add explanatory comments to (optimization) blocks in the Makefiles
Conflicts:
libavcodec/ppc/Makefile
libavcodec/x86/Makefile
Merged-by: Michael Niedermayer <michaelni@gmx.at>
It now does 12 samples per iteration, up from 4.
From 1.8 to 3.2 times faster again. 3.6 to 5.7 times faster overall.
Runtime is reduced by a further 2 to 18%. Overall runtime reduced by
4 to 50%.
Same conditions as before apply.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
From 1.8 to 2.4 times faster. Runtime is reduced by 2 to 39%. The
speed-up generally increases with compression_level.
This lpc encoder is not used with levels < 3 so it provides no speed-up
in these cases.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>