Henrik Gramner
428aa14a48
x86inc: Make INIT_CPUFLAGS support an arbitrary number of cpuflags
...
Previously there was a limit of two cpuflags.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-09-05 14:06:03 +02:00
Henrik Gramner
720c21d11f
x86inc: Make ym# behave the same way as xm#
...
This makes more sense for future implementations of templates with zmm registers.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-09-05 01:55:28 +02:00
Loren Merritt
a4dbabc8b3
x86inc: free up variable name "n" in global namespace
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-09-05 01:41:50 +02:00
Clément Bœsch
554d819062
avutil/pixelutils: faster pixelutils_sad_16x16
...
501 to 439 decicycles.
See 45c7f3997ea11c3d1007b2126b1c0049a8c27105.
2014-08-23 20:12:56 +02:00
Clément Bœsch
45c7f3997e
avutil/pixelutils: faster pixelutils_sad_[au]_16x16
...
~560 → ~500 decicycles
This is following the comments from Michael in
https://ffmpeg.org/pipermail/ffmpeg-devel/2014-August/160599.html
Using 2 registers for accumulator didn't help. On the other hand,
some re-ordering between the movs and psadbw allowed going ~538 to ~500.
2014-08-23 10:18:53 +02:00
Michael Niedermayer
70b8668fb5
drop LLS1, rename LLS2 to LLS
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-08-09 23:20:31 +02:00
Clément Bœsch
28a2107a8d
avutil: add pixelutils API
2014-08-05 21:05:52 +02:00
James Almer
d0f56ca071
x86/hevc_deblock: improve 8bit transpose store macros
...
Up to four instructions less depending on function and instruction set.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-08-03 04:24:15 +02:00
James Almer
1ace9573dc
x86/hevc_idct: replace old and unused idct functions
...
Only 8-bit and 10-bit idct_dc() functions are included (adding others should be trivial).
Benchmarks on an Intel Core i5-4200U:
idct8x8_dc
SSE2 MMXEXT C
cycles 22 26 57
idct16x16_dc
AVX2 SSE2 C
cycles 27 32 249
idct32x32_dc
AVX2 SSE2 C
cycles 62 126 1375
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Mickaël Raulet <mraulet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-07-26 18:00:11 +02:00
Michael Niedermayer
8d0c7031a8
Merge commit '79793f833784121d574454af4871866576c0749d'
...
* commit '79793f833784121d574454af4871866576c0749d':
Update Fiona's name in copyright statements.
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-07-01 15:43:40 +02:00
Diego Biurrun
79793f8337
Update Fiona's name in copyright statements.
2014-07-01 03:26:51 -07:00
Christophe Gisquet
9107612818
x86util: add and use RSHIFT/LSHIFT macros
...
Those macros take a byte number as shift argument, as this argument
differs between MMX and SSE2 instructions.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-15 13:19:27 +02:00
James Almer
85065d2a7c
x86/float_dsp: add missing femms
...
It was lost during the port.
Should fix fate on 3dnowext machines.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-08 20:06:28 +02:00
James Almer
dcaf9660b6
x86/float_dsp: port vector_fmul_window to yasm
...
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-08 12:41:32 +02:00
James Almer
fc8db12a73
x86/vp9: inital AVX2 intra_pred
...
tos3k-vp9-b10000.webm on a Core i5-4200U @1.6GHz
1219 decicycles in ff_vp9_ipred_dc_32x32_ssse3, 131070 runs, 2 skips
439 decicycles in ff_vp9_ipred_dc_32x32_avx2, 131070 runs, 2 skips
3570 decicycles in ff_vp9_ipred_dc_top_32x32_ssse3, 4096 runs, 0 skips
2494 decicycles in ff_vp9_ipred_dc_top_32x32_avx2, 4096 runs, 0 skips
1419 decicycles in ff_vp9_ipred_dc_left_32x32_ssse3, 16384 runs, 0 skips
717 decicycles in ff_vp9_ipred_dc_left_32x32_avx2, 16384 runs, 0 skips
2737 decicycles in ff_vp9_ipred_tm_32x32_avx, 1024 runs, 0 skips
2088 decicycles in ff_vp9_ipred_tm_32x32_avx2, 1024 runs, 0 skips
3090 decicycles in ff_vp9_ipred_v_32x32_avx, 512 runs, 0 skips
2226 decicycles in ff_vp9_ipred_v_32x32_avx2, 512 runs, 0 skips
1565 decicycles in ff_vp9_ipred_h_32x32_avx, 1024 runs, 0 skips
922 decicycles in ff_vp9_ipred_h_32x32_avx2, 1024 runs, 0 skips
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-08 02:37:20 +02:00
Christophe Gisquet
2267003981
x86: hpeldsp: better factorization
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 21:47:40 +02:00
James Almer
561bfc85eb
x86/dsputilenc: implement SSE2 versions of pix_{sum16, norm1}
...
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-28 23:29:34 +02:00
Matt Oliver
1898c2f49d
inline asm: fix arrays as named constraints.
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-07 15:02:45 +02:00
James Almer
3b06208a57
x86/float_dsp: remove duplicated code from vector_dmul_scalar
...
Use the xm# and ym# aliases as they remain in sync with m# after a SWAP.
No actual changes to the assembly.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-19 14:21:51 +02:00
James Almer
76ed71a72b
x86: move horizontal add macros to x86util
...
Also port relevant AVX2/XOP optimizations from x264 with permission
to relicense to LGPL from the corresponding authors
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-17 14:15:09 +02:00
James Almer
11b36b1ee0
x86/float_dsp: unroll loop in vector_fmac_scalar
...
~6% faster SSE2 performance. AVX/FMA3 are unaffected.
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-16 18:36:52 +02:00
James Almer
3b808900af
x86/float_dsp: use SWAP in vector_fmac_scalar Win64
...
The mova is unnecessary
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-16 15:46:21 +02:00
James Almer
2d9821a208
x86/cpu: check for OS support before enabling AVX2
...
AV_CPU_FLAG_AVX is enabled at this point only if there's OS support.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-25 17:56:43 +01:00
Matt Oliver
8236747511
Automatically change MANGLE() into named inline asm operands when direct symbol reference in inline asm are not supported.
...
This is part of the patch-set for intel C inline asm on windows support
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-18 23:39:30 +01:00
James Almer
7d7487e85c
x86/float_dsp: add ff_vector_{fmul_add, fmac_scalar}_fma3
...
~7% faster than AVX
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-13 04:34:05 +01:00
Michael Niedermayer
4159f702a7
avutil/timer: Fix units for x86 after c708b5403346255ea5adc776645616cc7c61f078
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-09 15:22:02 +01:00
James Almer
3f3d748cab
x86: Move XOP emulation to x86util
...
We need the emulation to support the cases where the first
argument is the same as the fourth. To achieve this a fifth
argument working as a temporary may be needed.
Emulation that doesn't obey the original instruction semantics
can't be in x86inc.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-24 08:30:19 +01:00
Michael Niedermayer
bd8d73ea8b
Merge remote-tracking branch 'qatar/master'
...
* qatar/master:
x86: add detection for Bit Manipulation Instruction sets
Conflicts:
libavutil/x86/cpu.c
See: 0bc3de19ffe296254f214dc7615e624d8e401bcb
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-23 22:52:58 +01:00
Michael Niedermayer
d9574069c1
Merge commit '1b932eb1508f550fac9e911923a0383efda53aa3'
...
* commit '1b932eb1508f550fac9e911923a0383efda53aa3':
x86: add detection for FMA3 instruction set
Conflicts:
configure
libavutil/cpu.h
libavutil/x86/cpu.c
See: a2af8eddab75f1eac712411e4dde89823c0845e8
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-23 22:43:08 +01:00
James Almer
d59fcdaff3
x86: add detection for Bit Manipulation Instruction sets
...
Based on x264 code
Signed-off-by: James Almer <jamrial@gmail.com>
2014-02-23 15:29:36 +01:00
James Almer
1b932eb150
x86: add detection for FMA3 instruction set
...
Based on x264 code
Signed-off-by: James Almer <jamrial@gmail.com>
2014-02-23 15:29:36 +01:00
James Almer
10b0161d78
x86: add missing XOP checks and macros
...
Signed-off-by: James Almer <jamrial@gmail.com>
2014-02-23 15:29:36 +01:00
James Almer
0bc3de19ff
x86: add detection for Bit Manipulation Instruction sets
...
Based on x264 code
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-22 17:26:00 +01:00
James Almer
a2af8eddab
x86: add detection for FMA3 instruction set
...
Based on x264 code
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-22 17:25:52 +01:00
Christophe Gisquet
996697e266
x86: float dsp: unroll SSE versions
...
vector_fmul and vector_fmac_scalar are guaranteed that they can process in
batch of 16 elements, but their SSE versions only does 8 at a time.
Therefore, unroll them a bit.
299 to 261c for 256 elements in vector_fmac_scalar on Arrandale/Win64.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-02-20 14:18:05 +01:00
Christophe Gisquet
133b34207c
x86: float dsp: unroll SSE versions
...
vector_fmul and vector_fmac_scalar are guaranteed that they can process in
batch of 16 elements, but their SSE versions only does 8 at a time.
Therefore, unroll them a bit.
299 to 261c for 256 elements in vector_fmac_scalar on Arrandale/Win64.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-15 18:54:21 +01:00
James Almer
23a8c63452
x86inc: Extend FMA_INSTR functionality
...
Support the cases where the first and last operand of
the XOP instruction are the same.
Also add vpmacsdql emulation.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-13 22:14:24 +01:00
James Almer
6c12b1de06
x86: add missing XOP checks and macros
...
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-11 03:46:52 +01:00
Loren Merritt
b7d0d10a1d
x86inc: Speed up assembling with Yasm
...
Work around Yasm's inefficiency with handling large numbers of variables
in the global scope.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2014-01-26 18:40:08 +01:00
Loren Merritt
4d55fe7204
x86inc: speed up compilation with yasm
...
Work around yasm's inefficiency with handling large numbers of variables
in the global scope.
2014-01-18 01:19:16 +01:00
Michael Niedermayer
c3814ab654
rename new lls code to lls2 to avoid conflict with the old which has a different ABI
...
also remove failed attempt at a compatibility layer, the code simply cannot work
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-17 16:41:08 +01:00
Michael Niedermayer
bbe66ef912
avutil: rename lls to lls2
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-17 16:30:23 +01:00
Michael Niedermayer
a665704402
Merge commit '4d6ee0725553a43ba88d6f8327ebcf8f1c5ae8d4'
...
* commit '4d6ee0725553a43ba88d6f8327ebcf8f1c5ae8d4':
libavutil: x86: Add AVX2 capable CPU detection.
Conflicts:
libavutil/cpu.c
libavutil/cpu.h
libavutil/x86/cpu.c
See: 865b70bc5d1cf37ec6d6cb729a69dda2cca28bd5
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-26 02:36:36 +02:00
Kieran Kunhya
865b70bc5d
Add AVX2 capable CPU detection. Patch based on x264's AVX2 detection
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-26 02:34:22 +02:00
Kieran Kunhya
4d6ee07255
libavutil: x86: Add AVX2 capable CPU detection.
...
Patch based on x264's AVX2 detection
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2013-10-25 19:36:55 +01:00
Michael Niedermayer
f9bef2bec9
Merge remote-tracking branch 'qatar/master'
...
* qatar/master:
x86: more AVX2 framework
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-14 16:13:57 +02:00
Michael Niedermayer
e3e0e3d0c9
Merge commit 'c6908d6b4b377a04a5d055ba874bdbcf06c80497'
...
* commit 'c6908d6b4b377a04a5d055ba874bdbcf06c80497':
x86inc: FMA3/4 Support
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-14 16:06:22 +02:00
Michael Niedermayer
9ac124c889
Merge commit '206895708ea2b464755d340e44501daf9a07c310'
...
* commit '206895708ea2b464755d340e44501daf9a07c310':
x86inc: Remove our FMA4 support
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-14 15:54:23 +02:00
Michael Niedermayer
12e4493f9c
Merge commit 'c108ba0175d4fc3a3253a8b0f782fbfb96ba5098'
...
* commit 'c108ba0175d4fc3a3253a8b0f782fbfb96ba5098':
x86inc: Use VEX-encoded instructions in AVX functions
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-14 15:48:34 +02:00
Jason Garrett-Glaser
a3fabc6cb3
x86: more AVX2 framework
...
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2013-10-14 12:41:56 +01:00