Commit Graph

1524 Commits

Author SHA1 Message Date
Diego Biurrun
17608f6ee3 x86: Add some more missing headers 2014-03-13 05:50:28 -07:00
Diego Biurrun
08dba0e1c3 x86: mpegvideoenc: Remove some remnants of the long-gone libmpeg2 IDCT 2014-03-13 05:50:28 -07:00
James Almer
9e0e1f9067 x86/dsputil: add emms to ff_scalarproduct_int16_mmxext()
Also undo the changes to ra144enc.c from previous commits.
Should fix ticket #3429

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-06 18:23:55 +01:00
Michael Niedermayer
2d99de66b7 Merge commit '3bfdee00cd92ff07c364d4901c4aefda32780756'
* commit '3bfdee00cd92ff07c364d4901c4aefda32780756':
  x86: dcadsp: Fix linking with yasm and optimizations disabled

Conflicts:
	libavcodec/x86/dcadsp_init.c

See: 206167a295
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-06 14:10:27 +01:00
Diego Biurrun
3bfdee00cd x86: dcadsp: Fix linking with yasm and optimizations disabled
Some optimized functions reference optimized symbols, so the functions
must be explicitly disabled when those symbols are unavailable.
2014-03-05 23:16:21 +01:00
Michael Niedermayer
146b476ba0 Merge commit '3741aa37c2a0d0717faff74a5c4cc357d16f6d1d'
* commit '3741aa37c2a0d0717faff74a5c4cc357d16f6d1d':
  x86: cabac: Use correct #includes to make header compile standalone

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-05 21:33:44 +01:00
Diego Biurrun
3741aa37c2 x86: cabac: Use correct #includes to make header compile standalone 2014-03-05 13:32:25 +01:00
James Almer
7fd64e3e36 x86/synth_filter: add synth_filter_fma3
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-05 01:58:16 +01:00
James Almer
206167a295 x86/synth_filter: add missing HAVE_YASM guard
Should fix compilation failures with --disable-yasm on some compilers

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-04 22:47:28 +01:00
James Almer
884e085d1e x86/synth_filter: Revert the switch to float ops with SSE2
This reverts the changes 6467209836
and 68c3ed936a did to the SSE2 version,
which generated a hit of about 5 cycles.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-02 11:58:10 +01:00
James Almer
68c3ed936a x86/synth_filter: add synth_filter_avx
Sandy Bridge Win64:
180 cycles on ff_synth_filter_inner_sse2
150 cycles on ff_synth_filter_inner_avx

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-02 01:00:55 +01:00
James Almer
6467209836 x86/synth_filter: add synth_filter_sse
Build only on x86_32 targets.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-01 15:32:40 +01:00
Michael Niedermayer
fb3c33f3cd Merge commit '4cb6964244fd6c099383d8b7e99731e72cc844b9'
* commit '4cb6964244fd6c099383d8b7e99731e72cc844b9':
  dcadec: simplify decoding of VQ high frequencies

Conflicts:
	configure
	libavcodec/dcadec.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 21:41:19 +01:00
Michael Niedermayer
baf3adc621 Merge commit '08e3ea60ff4059341b74be04a428a38f7c3630b0'
* commit '08e3ea60ff4059341b74be04a428a38f7c3630b0':
  x86: synth filter float: implement SSE2 version

Conflicts:
	libavcodec/x86/dcadsp.asm
	libavcodec/x86/dcadsp_init.c

See: 2cdbcc0048
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 20:38:39 +01:00
Christophe Gisquet
2cdbcc0048 x86: synth filter float: implement SSE2 version
Timings for Arrandale:
          C    SSE
win32:  2108   334
win64:  1152   322

Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.

Unrolling for ARCH_X86_64 is a 20 cycles gain.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 20:34:40 +01:00
Michael Niedermayer
e346a59383 Merge commit 'ad507d7907457e678900bac132122ba7be4644cb'
* commit 'ad507d7907457e678900bac132122ba7be4644cb':
  x86: dcadsp: implement SSE lfe_dir

Conflicts:
	libavcodec/x86/dcadsp.asm

See: 169243112c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 19:22:00 +01:00
Christophe Gisquet
169243112c x86: dcadsp: implement SSE lfe_dir
Results for Arrandale/Windows:
32: 1670 -> 316
64:  728 -> 298

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 19:20:03 +01:00
Michael Niedermayer
5ba1648318 Merge commit 'b23650491fbd579a4365f42bd42575afb7b53f7e'
* commit 'b23650491fbd579a4365f42bd42575afb7b53f7e':
  prores: Use consistent names for DSP arch initialization functions

Conflicts:
	libavcodec/proresdsp.c
	libavcodec/proresdsp.h
	libavcodec/x86/proresdsp_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 17:13:00 +01:00
Christophe Gisquet
4cb6964244 dcadec: simplify decoding of VQ high frequencies
The vector dequantization has a test in a loop preventing effective SIMD
implementation. By moving it out of the loop, this loop can be DSPized.

Therefore, modify the current DSP implementation. In particular, the
DSP implementation no longer has to handle null loop sizes.

The decode_hf implementations have following timings:

For x86 Arrandale:
        C  SSE SSE2 SSE4
win32: 260 162  119  104
win64: 242 N/A   89   72

The arm NEON optimizations follow in a later patch as external asm. The
now unused check for the y modifier in arm inline asm is removed from
configure.
2014-02-28 13:03:22 +01:00
Christophe Gisquet
08e3ea60ff x86: synth filter float: implement SSE2 version
Timings for Arrandale:
          C    SSE
win32:  2108   334
win64:  1152   322

Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.

Unrolling for ARCH_X86_64 is a 20 cycles gain.

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-02-28 13:00:48 +01:00
Christophe Gisquet
ad507d7907 x86: dcadsp: implement SSE lfe_dir
Results for Arrandale/Windows:
32: 1670 -> 316
64:  728 -> 298

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-02-28 13:00:47 +01:00
Diego Biurrun
b23650491f prores: Use consistent names for DSP arch initialization functions 2014-02-28 10:34:55 +01:00
James Almer
2163a40a46 x86/imdct36: use sse3 instructions in the last BUTTERF step when possible
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-27 23:28:15 +01:00
James Almer
fbf98375e4 x86/imdct36: don't build imdct36_float_sse on x86_64 targets
There's an SSE2 version as well, and x86_64 guarantees that
instruction set is present.

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-27 22:54:03 +01:00
James Almer
3f3d748cab x86: Move XOP emulation to x86util
We need the emulation to support the cases where the first
argument is the same as the fourth. To achieve this a fifth
argument working as a temporary may be needed.
Emulation that doesn't obey the original instruction semantics
can't be in x86inc.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-24 08:30:19 +01:00
Michael Niedermayer
8372aaf721 Merge commit '017a06a9ee86b047079166c2694c9c655ff03356'
* commit '017a06a9ee86b047079166c2694c9c655ff03356':
  x86: dsputil: Use correct file name as multiple inclusion guard

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-20 14:58:04 +01:00
Diego Biurrun
017a06a9ee x86: dsputil: Use correct file name as multiple inclusion guard 2014-02-20 04:16:15 -08:00
Michael Niedermayer
130c33af35 Merge commit 'b23bc95920e2f10b9621857e829c45b064f356c0'
* commit 'b23bc95920e2f10b9621857e829c45b064f356c0':
  x86: dca: Add missing multiple inclusion guards

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-19 15:44:48 +01:00
Diego Biurrun
b23bc95920 x86: dca: Add missing multiple inclusion guards 2014-02-19 10:19:15 +01:00
Hendrik Leppkes
7716eda0aa vp9/x86: set correct number of registers used in intra pred asm
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-18 17:20:14 +01:00
James Almer
07b4b0ca62 tta/x86: add ff_ttafilter_process_dec_{ssse3, sse4}
Results are from a Win64 build running on an AMD FX 6300

1121 decicycles in ttafilter_process_dec_c, 16777112 runs, 104 skips
522 decicycles in ff_ttafilter_process_dec_ssse3, 16777149 runs, 67 skips
477 decicycles in ff_ttafilter_process_dec_sse4, 16777156 runs, 60 skips

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-17 13:51:19 +01:00
Ronald S. Bultje
fdb093c4e4 vp9/x86: intra prediction SIMD.
Partially based on h264_intrapred. (I hope to eventually merge these
two intrapred implementations back together.)
2014-02-17 13:39:00 +01:00
James Almer
ec482e738d x86/fladsp: add missing check to ff_flacdsp_init_x86()
Fixes compilation with flac decoder disabled and encoder enabled

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-16 12:06:04 +01:00
Michael Niedermayer
d601106ab1 avcodec/x86/lossless_videodsp: fix w type
Fixes fate issues on mingw64

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-15 06:41:38 +01:00
Peter Ross
b8664c9294 avcodec/vp8dsp: add VP7 idct and loop filter
Signed-off-by: Peter Ross <pross@xvid.org>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-15 02:15:35 +01:00
James Almer
e87974bc00 flac/x86: add ff_flac_lpc_32_xop()
Tested on an AMD FX 6300

679081 decicycles in ff_flac_lpc_32_xop, 32768 runs
774425 decicycles in ff_flac_lpc_32_sse4, 32768 runs

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-13 22:14:59 +01:00
James Darnley
623f380a18 lavc: fix flac encoder and decoder dependencies
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-13 21:00:32 +01:00
Michael Niedermayer
df98b36aa6 Merge commit '5c1c6e82261b856214499b9fef3a08bf3ff6e0ae'
* commit '5c1c6e82261b856214499b9fef3a08bf3ff6e0ae':
  dca: include dcadsp.h in {arm,x86}/dca.h for checkheaders

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-08 17:25:31 +01:00
Michael Niedermayer
dd2b330347 Merge commit '0cffd6fff59f192120dc93aa6c3cb8180f5506e3'
* commit '0cffd6fff59f192120dc93aa6c3cb8180f5506e3':
  x86: use the inline int8x8_fmul_int32 only if inline SSE2 is availbale

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-08 17:06:57 +01:00
Janne Grunau
5c1c6e8226 dca: include dcadsp.h in {arm,x86}/dca.h for checkheaders 2014-02-08 13:38:36 +01:00
Janne Grunau
0cffd6fff5 x86: use the inline int8x8_fmul_int32 only if inline SSE2 is availbale
Fixes compilation with MSVC. Also does not rely on on earlier config.h
include but include it directly.
2014-02-08 12:10:56 +01:00
Clément Bœsch
669d4f9053 x86/vp9lpf: simplify 2nd transpose in 44/48/88/84.
For non-avx optims, this saves 8 movs.

before:
  1785 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524129 runs, 159 skips
  3327 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262116 runs, 28 skips
  2712 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193729 runs, 575 skips
  3237 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524061 runs, 227 skips

after:
  1768 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524062 runs, 226 skips
  3310 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262107 runs, 37 skips
  2719 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193954 runs, 350 skips
  3184 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524236 runs, 52 skips
2014-02-08 11:10:23 +01:00
Michael Niedermayer
82ae8a44e6 Merge commit '5b59a9fc6152169599561f04b4f66370edda5c9c'
* commit '5b59a9fc6152169599561f04b4f66370edda5c9c':
  x86: dcadsp: implement int8x8_fmul_int32

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-08 01:20:33 +01:00
Christophe Gisquet
5b59a9fc61 x86: dcadsp: implement int8x8_fmul_int32
For the callable function (as opposed to the inline one):
         C  SSE  SSE2  SSE4
Win32:  47   42   29    26
Win64:  30   33   25    23
The SSE version is neither compiled nor set for ARCH_X86_64, as the
inlinable function takes over.

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-02-07 22:52:40 +01:00
Loren Merritt
9c978f243a flac/x86: add ff_flac_lpc_32_sse4()
benchmarked on sandybridge x86_64:
1358232 decicycles in flac_lpc_32_c
1244575 decicycles in flac_lpc_32_sse4, James Almer's patch
 650045 decicycles in flac_lpc_32_sse4, this patch

I haven't tested the edgecases such as odd block lengths

odd block length tested-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-06 02:51:19 +01:00
Clément Bœsch
d92a725329 x86/vp9lpf: remove 8 SWAPs in 84/48 transpose. 2014-02-05 07:21:13 +01:00
Clément Bœsch
97dde561de x86/vp9lpf: remove braindead double pxor. 2014-02-05 07:21:11 +01:00
Clément Bœsch
9a3b05b0a9 x86/vp9lpf: save a few mov in flat8in/hev masks calc. 2014-02-05 07:21:09 +01:00
Clément Bœsch
91d85bb167 x86/vp9lpf: add ff_vp9_loop_filter_[vh]_44_16_{sse2,ssse3,avx}. 2014-02-05 07:21:06 +01:00
Michael Niedermayer
de17ccc774 Merge commit '51daafb02eaf96e0743a37ce95a7f5d02c1fa3c2'
* commit '51daafb02eaf96e0743a37ce95a7f5d02c1fa3c2':
  x86: videodsp: Properly mark sse2 instructions in emulated_edge_mc as such.

Conflicts:
	libavcodec/x86/videodsp_init.c

See: 1b3a7e1f42
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-31 14:30:30 +01:00
Clément Bœsch
c5dd73b890 x86/vp9lpf: add ff_vp9_loop_filter_h_{48,84}_16_{sse2,ssse3,avx}().
5.40s → 5.30s overall decode time with -threads 1 on ped1080p.webm
(i7 920, ssse3)
2014-01-30 19:34:13 +01:00
Ronald S. Bultje
9ee9c679a7 x86: videodsp: Fix a bug in a %if statement where we used '%%' instead of '&&'.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-01-30 15:33:23 +01:00
Ronald S. Bultje
51daafb02e x86: videodsp: Properly mark sse2 instructions in emulated_edge_mc as such.
Should fix crashes or corrupt output on pre-SSE2 CPUs when they were
using SSE2-code (e.g. AMD Athlon XP 2400+ or Intel Pentium III) in
hfix or hvar single-edge (left/right) extension functions.

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-01-30 15:30:01 +01:00
James Almer
644c32ea4b x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_sse2()
Similar gains as the ssse3 version once again

Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-28 09:30:55 +01:00
Clément Bœsch
222c46c531 x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_{ssse3,avx}.
9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips
9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips

1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips
2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips

5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1)

Adding SSE2 support should be relatively trivial (just a matter of
changing the pshufb [mask_mix] with something else), patch welcome.
2014-01-28 07:36:38 +01:00
Clément Bœsch
822385d775 x86/vp9lpf: add a preload system in FILTER_UPDATE.
Allow some macro refactoring in filter14().
2014-01-27 22:39:26 +01:00
Clément Bœsch
315b4775ad x86/vp9lpf: refactor v/h using common macros for P7 to Q7. 2014-01-27 22:39:26 +01:00
Clément Bœsch
5d144086cc x86/vp9lpf: faster P7..Q7 accesses.
Introduce 2 additional registers for stride3 and mstride3 to allow
direct accesses (lea drops).

3931 → 3827 decicycles in ff_vp9_loop_filter_v_16_16_ssse3

Also uses defines to clarify the code.
2014-01-27 22:37:42 +01:00
Clément Bœsch
5f4d04d084 x86/lossless_videodsp: silly one-line cosmetic. 2014-01-25 16:24:50 +01:00
Clément Bœsch
5267e85056 x86/lossless_videodsp: use common macro for add and diff int16 loop. 2014-01-25 14:27:37 +01:00
Clément Bœsch
cddbfd2a95 x86/lossless_videodsp: simplify and explicit aligned/unaligned flags 2014-01-25 11:59:43 +01:00
Ronald S. Bultje
c9e6325ed9 vp9/x86: use explicit register for relative stack references.
Before this patch, we explicitly modify rsp, which isn't necessarily
universally acceptable, since the space under the stack pointer might
be modified in things like signal handlers. Therefore, use an explicit
register to hold the stack pointer relative to the bottom of the stack
(i.e. rsp). This will also clear out valgrind errors about the use of
uninitialized data that started occurring after the idct16x16/ssse3
optimizations were first merged.
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
97474d527f vp9/x86: iwht4x4 (lossless) mmx. 2014-01-24 19:25:25 -05:00
Ronald S. Bultje
d43efa68bd vp9/x86: 4x4 iadst SIMD (ssse3) variants.
Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct:    66 -> 67 cycles (noise measurement)
idct_iadst:  199 -> 79 cycles
iadst_idct:  165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
baf47020cd vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants.
Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct:   133 -> 135 cycles (noise measurement)
idct_iadst:  900 -> 241 cycles
iadst_idct:  864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles
2014-01-24 19:25:25 -05:00
Michael Niedermayer
e6d1c66d74 avcodec/x86/lossless_videodsp: disable median optimizations for 16bps
They only support upto 15bps

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:51:24 +01:00
Michael Niedermayer
eaacfc7dd1 avcodec/lossless_videodsp: Pass AVCodecContext to init
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:43:00 +01:00
Michael Niedermayer
ef00ef7553 avcodec/x86/lossless_videodsp: port sub_hfyu_median_prediction_int16 to yasm
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 23:27:27 +01:00
Michael Niedermayer
fad49aae28 avcodec/x86/lossless_videodsp: Port sub_hfyu_median_prediction_mmxext to int16
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 22:55:49 +01:00
Michael Niedermayer
fee97f25fa avcodec/x86/lossless_videodsp: port add_hfyu_median_prediction_mmxext to 16bit
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 21:11:40 +01:00
Michael Niedermayer
631939bde6 avcodec/x86/lossless_videodsp: add diff_int16_mmx/sse2
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 19:41:21 +01:00
Reimar Döffinger
76421982d0 lossless_videodsp.asm: fix compilation.
Fixes these errors with nasm:
libavcodec/x86/lossless_videodsp.asm:86: error: invalid combination of opcode and operands
libavcodec/x86/lossless_videodsp.asm:88: error: invalid combination of opcode and operands
I don't know whether movd or movq was meant, but either way
maskq vs. maskd must match the mov size.

Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
2014-01-21 19:46:02 +01:00
Michael Niedermayer
83b67ca056 avcodec/x86/lossless_videodsp: Port lorens add_hfyu_left_prediction_ssse3/sse4 to 16bit
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:55:41 +01:00
Michael Niedermayer
63d2be7533 avcodec/x86/lossless_videodsp: use SPLATW in add_int16
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:33:20 +01:00
Michael Niedermayer
f70d7eb20c Move add/diff_int16 to lossless_videodsp
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 21:32:47 +01:00
Michael Niedermayer
a493f8541d avcodec/x86/dsp: add_int16_mmx / add_int16_sse2
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 04:06:46 +01:00
James Almer
26800e3864 vp9/x86: rename ff_avg[48]_sse to ff_avg[48]_mmxext
pavgb is an sse integer instruction, so the mmxext flag is enough

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-18 17:08:25 +01:00
James Almer
d2a7314f1e vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2().
Similar gains in performance as the SSSE3 version

Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-17 14:16:38 +01:00
Ronald S. Bultje
8173d1ffc0 vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx).
Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct:  4672 -> 1175 cycles
idct_iadst:  4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.
2014-01-16 13:49:31 +01:00
Clément Bœsch
9cc8fa63dd vp9/x86: simplify a few mc inits. 2014-01-16 07:48:27 +01:00
Michael Niedermayer
6391dec82a Merge remote-tracking branch 'qatar/master'
* qatar/master:
  x86: dsputil: Simplify xvmc deprecation conditional

Conflicts:
	libavcodec/x86/dsputil_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 20:41:08 +01:00
Diego Biurrun
aab40bbfd5 x86: dsputil: Simplify xvmc deprecation conditional 2014-01-15 15:23:46 +01:00
Clément Bœsch
8b4190da93 vp9/x86: add AVX for itxfm and lpf.
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips

3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips

23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips

4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips

967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
2014-01-15 15:54:03 +01:00
Michael Niedermayer
cb613657ee avcodec/x86/proresdsp_init: x86 prores IDCT is bitexact again
reenable it for for bitexact mode

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 15:59:00 +01:00
Michael Niedermayer
b148a39d55 Merge commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7'
* commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7':
  x86: Consistently use cpu flag detection macros in places that still miss it

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 14:44:59 +01:00
Diego Biurrun
46bacb5cc6 x86: Consistently use cpu flag detection macros in places that still miss it 2014-01-14 00:04:58 +01:00
Clément Bœsch
af68bd1c06 vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3().
16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips
17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips

4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips
3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips

Overall decode time goes from:
  ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null -  8.10s user 0.02s system 99% cpu 8.126 total
to:
  ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null -  6.15s user 0.04s system 99% cpu 6.199 total

(46 to 61 fps)
2014-01-12 20:20:24 +01:00
Clément Bœsch
e11ceea68f vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X. 2014-01-12 20:19:00 +01:00
Clément Bœsch
c9aa0b8f70 vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X. 2014-01-12 20:18:55 +01:00
Clément Bœsch
7c55ee6168 vp9/x86: merge IDCT coef macros. 2014-01-12 20:18:44 +01:00
Michael Niedermayer
92b2404571 Merge commit '4c642d8d98703faf52983243098f35865e15b312'
* commit '4c642d8d98703faf52983243098f35865e15b312':
  x86: hpeldsp: Add missing av_cold attribute to init function

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:32:53 +01:00
Michael Niedermayer
390452bab6 Merge commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b'
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
  x86: avcodec: Add a bunch of missing #includes for av_cold

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:24:15 +01:00
Diego Biurrun
4c642d8d98 x86: hpeldsp: Add missing av_cold attribute to init function 2014-01-09 15:09:07 +01:00
Diego Biurrun
b0be1ae792 x86: avcodec: Add a bunch of missing #includes for av_cold 2014-01-09 15:09:07 +01:00
Ronald S. Bultje
c6fe984f2f vp9/x86: make STORE_2X2 macro local.
Prevents this assembler warning:
libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
redefining multi-line macro `STORE_2X2'

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-08 14:07:15 +01:00
Ronald S. Bultje
04a187fb2a vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct.
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.
2014-01-07 20:43:35 -05:00
Ronald S. Bultje
37b001d14d vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct.
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
2014-01-07 20:43:34 -05:00
Ronald S. Bultje
e84d14df10 vp9/x86: idct_32x32_add_ssse3.
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
2014-01-07 20:43:30 -05:00
Michael Niedermayer
30056fd0be Merge commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb'
* commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb':
  h264: do not use 422 functions for monochrome

See: 07abf13da4
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-06 16:51:23 +01:00
Anton Khirnov
a03a642d5c h264: do not use 422 functions for monochrome
Fixes invalid memory access.

Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC:libav-stable@libav.org
2014-01-06 08:25:36 +01:00