Commit Graph

1978 Commits

Author SHA1 Message Date
James Almer
6ec3dc97fc x86/audiodsp: move asm code out of dsputil
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-22 19:53:09 +02:00
Michael Niedermayer
99497b4683 Merge commit '9a9e2f1c8aa4539a261625145e5c1f46a8106ac2'
* commit '9a9e2f1c8aa4539a261625145e5c1f46a8106ac2':
  dsputil: Split audio operations off into a separate context

Conflicts:
	configure
	libavcodec/takdec.c
	libavcodec/x86/Makefile
	libavcodec/x86/dsputil.asm
	libavcodec/x86/dsputil_init.c
	libavcodec/x86/dsputil_mmx.c
	libavcodec/x86/dsputil_x86.h

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-22 17:58:28 +02:00
Diego Biurrun
9a9e2f1c8a dsputil: Split audio operations off into a separate context 2014-06-22 06:20:15 -07:00
Michael Niedermayer
33f83a2157 avcodec/x86/rv40dsp_init: fix () in macros
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-20 21:36:43 +02:00
James Almer
a5ce608fc7 x86/blockdsp: restore author attribution
See commits
649c00c96d
5fecfb7d58
73b02e2460

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-19 18:31:44 +02:00
Michael Niedermayer
08c5859f17 avcodec: add simpleauto idct
This will pick the "best" simple idct compatible idct

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-19 14:28:01 +02:00
James Almer
454c019cb5 x86/hevc_idct: fix movd parameter size in DC_ADD_INIT
Fixes compilation with NASM x86_64

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-19 13:18:13 +02:00
James Almer
fe782233aa x86/blockdsp: move asm code out of dsputil
Also replace INLINE_<opt> with EXTERNAL_<opt> that were wrongly
changed by commit 2b05db4f81

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-19 13:09:03 +02:00
Michael Niedermayer
042a82ca37 avcodec/x86/lossless_videodsp: Fix size of values read for left/left_top
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-19 05:53:41 +02:00
Michael Niedermayer
2b05db4f81 Merge commit 'e74433a8e6fc00c8dbde293c97a3e45384c2c1d9'
* commit 'e74433a8e6fc00c8dbde293c97a3e45384c2c1d9':
  dsputil: Split clear_block*/fill_block* off into a separate context

Conflicts:
	configure
	libavcodec/asvdec.c
	libavcodec/dnxhddec.c
	libavcodec/dnxhdenc.c
	libavcodec/dsputil.h
	libavcodec/eamad.c
	libavcodec/intrax8.c
	libavcodec/mjpegdec.c
	libavcodec/ppc/dsputil_ppc.c
	libavcodec/vc1dec.c
	libavcodec/x86/dsputil_init.c
	libavcodec/x86/dsputil_mmx.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-19 04:54:38 +02:00
Diego Biurrun
e74433a8e6 dsputil: Split clear_block*/fill_block* off into a separate context 2014-06-18 14:07:23 -07:00
plepere
92cccb7bcd avcodec/hevc: new idct + asm
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-17 13:23:36 +02:00
Christophe Gisquet
9107612818 x86util: add and use RSHIFT/LSHIFT macros
Those macros take a byte number as shift argument, as this argument
differs between MMX and SSE2 instructions.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-15 13:19:27 +02:00
Ronald S. Bultje
385a3420d1 vp9/x86: fix overwrite in ipred_vl_4x4_ssse3.
Fixes track ticket 3717.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-12 04:11:20 +02:00
Christophe Gisquet
508e7a5c16 x86: huffyuv: fix {add,diff}_int16
They used an extra, undeclared register. Fixes a crash in
fate-vsynth3-ffvhuff444p16

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-12 00:26:19 +02:00
Michael Niedermayer
1a2ff62859 Merge commit '570d4b21863b6254d6bbca9c528bede471bb4478'
* commit '570d4b21863b6254d6bbca9c528bede471bb4478':
  x86: h264: Don't keep data in the redzone across function calls on 64 bit unix

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-10 18:35:49 +02:00
Martin Storsjö
570d4b2186 x86: h264: Don't keep data in the redzone across function calls on 64 bit unix
We know that the called function (ff_chroma_inter_body_mmxext)
doesn't touch the redzone, and thus will be kept intact - thus,
this doesn't fix any bug per se.

However, valgrind's memcheck tool intentionally assumes that the
redzone is clobbered on every function call and function return
(see a long comment in valgrind/memcheck/mc_main.c). This avoids
false positives in that tool, at the cost of an extra stack pointer
adjustment.

The other alternative would be a valgrind suppression for this issue,
but that's an extra burden for everybody that wants to run libavcodec
within valgrind.

Signed-off-by: Martin Storsjö <martin@martin.st>
2014-06-10 16:31:48 +03:00
Michael Niedermayer
06f576c4ab avcodec/x86/dct_init: fix build failure with clang && disable-optimizations
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-09 19:32:41 +02:00
James Almer
6d408495b5 x86/dct32: don't build ff_dct32_float_sse on x86_64
There's an SSE2 version already, and technically the SSE version
on x86_64 was wrong (using pshufd and pshuflw, SSE2 instructions).

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-09 00:51:43 +02:00
James Almer
fc8db12a73 x86/vp9: inital AVX2 intra_pred
tos3k-vp9-b10000.webm on a Core i5-4200U @1.6GHz

1219 decicycles in ff_vp9_ipred_dc_32x32_ssse3, 131070 runs, 2 skips
439 decicycles in ff_vp9_ipred_dc_32x32_avx2, 131070 runs, 2 skips

3570 decicycles in ff_vp9_ipred_dc_top_32x32_ssse3, 4096 runs, 0 skips
2494 decicycles in ff_vp9_ipred_dc_top_32x32_avx2, 4096 runs, 0 skips

1419 decicycles in ff_vp9_ipred_dc_left_32x32_ssse3, 16384 runs, 0 skips
717 decicycles in ff_vp9_ipred_dc_left_32x32_avx2, 16384 runs, 0 skips

2737 decicycles in ff_vp9_ipred_tm_32x32_avx, 1024 runs, 0 skips
2088 decicycles in ff_vp9_ipred_tm_32x32_avx2, 1024 runs, 0 skips

3090 decicycles in ff_vp9_ipred_v_32x32_avx, 512 runs, 0 skips
2226 decicycles in ff_vp9_ipred_v_32x32_avx2, 512 runs, 0 skips

1565 decicycles in ff_vp9_ipred_h_32x32_avx, 1024 runs, 0 skips
922 decicycles in ff_vp9_ipred_h_32x32_avx2, 1024 runs, 0 skips

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-08 02:37:20 +02:00
James Almer
ec98f80af4 x86/dsputil: move some mmx init code inside dsputil_init_mmx()
This reduces differences with the fork

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-06 05:26:04 +02:00
Christophe Gisquet
ccff45a0d3 apedsp: move to llauddsp
APE is not the sole codec using scalarproduct_and_madd_int16.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-05 20:31:59 +02:00
Michael Niedermayer
d5c9d055ea avcodec/x86/dsputilenc_mmx: fix build without yasm
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-04 05:39:03 +02:00
James Almer
625ffa1457 x86/motion_est: sad_{x, y}2_mmxext functions are bitexact
Only the xy2 functions aren't.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-04 00:48:35 +02:00
Timothy Gu
108dec3055 x86: dsputilenc: convert hf_noise*_mmx to yasm
Signed-off-by: Timothy Gu <timothygu99@gmail.com>
Several bugfixes by: Christophe Gisquet <christophe.gisquet@gmail.com>
See: [FFmpeg-devel] [WIP] [PATCH 4/4] x86: dsputilenc: convert hf_noise*_mmx to yasm
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-03 23:59:43 +02:00
Christophe Gisquet
dcd2a6ca36 x86: hevc_mc: remove unneeded shift
The immediate value may be 0.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-01 23:34:33 +02:00
Christophe Gisquet
09fc28aed1 x86: hevcdsp_init: fix macro usage
The macro was not using the parameter but unconditionally using sse4.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-01 23:20:07 +02:00
James Almer
e1bd40fe6b x86/motion_est: enable sad16_sse2 on k10 CPUs
The check is meant for k8 CPUs. sad16_sse2 is ~20% faster than sad16_mmxext on k10.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-01 02:10:32 +02:00
James Almer
f128342df2 build: fix compilation of svq1enc_mmx.c with --disable-mmx
It's needed for ff_svq1enc_init_x86() even if simd functions are disabled.

Alternatively, svq1enc_init.c could be made and the relevant code moved there.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-31 00:38:24 +02:00
James Almer
4ac41a52e2 x86/huffyuvdsp: fix some prototypes
Remove duplicate prototypes and fix int -> intptr_t in another

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-31 00:29:00 +02:00
Christophe Gisquet
d136fe6fd7 x86: huffyuvdsp: fewer functions for x86_64
When there are 2 functions that are <= SSE2, only one is needed for x86_64.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 21:39:06 +02:00
Timothy Gu
154cee9292 x86: dsputilenc: convert ff_sse{8, 16}_mmx() to yasm
Signed-off-by: Timothy Gu <timothygu99@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 16:57:52 +02:00
Timothy Gu
0b6292b7b8 x86: dsputilenc: move all the function prototypes together
Signed-off-by: Timothy Gu <timothygu99@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 16:18:10 +02:00
Christophe Gisquet
f743fa9c7f x86: huffyuvdsp: add_hfyu_left_pred_bgr32
C   MMX   SSE2
Cycles: 3092  1053  578

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 15:20:36 +02:00
Michael Niedermayer
7be79c76d3 avcodec/huffyuvdsp: Change w to intptr in add_hfyu_median_pred() and add_hfyu_left_pred()
This avoids potential issues with the high 32bits being random in x86-64 asm

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 15:12:58 +02:00
Christophe Gisquet
884078d2df x86: huffyuvdsp: add SSE2 median prediction
From 5010c to 4566 on lagarith YUY2.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 14:57:57 +02:00
Michael Niedermayer
8c891d90ca avcodec/x86/qpeldsp_init: Restore author attribution
See: 368f50359e
See: 44eb495128, and many others
See:
similarity index 83%
copy from libavcodec/x86/dsputil_init.c
copy to libavcodec/x86/qpeldsp_init.c
index ebbf97f..8f296a1 100644
--- a/libavcodec/x86/dsputil_init.c
+++ b/libavcodec/x86/qpeldsp_init.c
@@ -1,6 +1,5 @@
 /*
- * Copyright (c) 2000, 2001 Fabrice Bellard
- * Copyright (c) 2002-2004 Michael Niedermayer <michaelni@gmx.at>
+ * quarterpel DSP functions
  *
  * This file is part of FFmpeg.
  *

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 04:05:29 +02:00
Michael Niedermayer
98a6806fdd Merge commit '368f50359eb328b0b9d67451f56fda20b3255f9a'
* commit '368f50359eb328b0b9d67451f56fda20b3255f9a':
  dsputil: Split off quarterpel bits into their own context

Conflicts:
	configure
	libavcodec/dsputil.c
	libavcodec/h263dec.c
	libavcodec/mpegvideo.c
	libavcodec/mpegvideo_enc.c
	libavcodec/vc1dec.c
	libavcodec/vc1dsp.c
	libavcodec/x86/dsputil_init.c
	libavcodec/x86/qpeldsp.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 02:43:34 +02:00
Michael Niedermayer
40f3a87c10 Merge commit '054013a0fc6f2b52c60cee3e051be8cc7f82cef3'
* commit '054013a0fc6f2b52c60cee3e051be8cc7f82cef3':
  dsputil: Move APE-specific bits into apedsp

Conflicts:
	libavcodec/arm/int_neon.S
	libavcodec/x86/dsputil.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 00:59:15 +02:00
Michael Niedermayer
c814a6c778 avcodec/x86/svq1enc_mmx: Add author attribution
See: 5900637219
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 00:30:05 +02:00
Michael Niedermayer
ea0931fb96 Merge commit '65d5d5865845f057cc6530a8d0f34db952d9009c'
* commit '65d5d5865845f057cc6530a8d0f34db952d9009c':
  dsputil: Move SVQ1 encoding specific bits into svq1enc

Conflicts:
	libavcodec/x86/Makefile

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-30 00:01:45 +02:00
James Almer
02a3e327f1 x86/dsputilenc: add missing guards to ff_pix_sum16_xop
XOP support was added in Yasm 1.0.0 and Nasm 2.06, and we still
support older versions.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 22:31:28 +02:00
Christophe Gisquet
99a319c4e7 x86: huffyuvdsp: port add_bytes to yasm
C   MMX  SSE2
Cycles: 2972  587  302

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 21:56:00 +02:00
Christophe Gisquet
2267003981 x86: hpeldsp: better factorization
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 21:47:40 +02:00
Michael Niedermayer
7b4c46050e rename add_hfyu_left_prediction_int16 to add_hfyu_left_pred_int16
This makes the naming more consistent with the 8bit variant

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 19:50:44 +02:00
Michael Niedermayer
550ae6c02f rename add_hfyu_median_prediction_int16 to add_hfyu_median_pred_int16
This makes the naming more consistent with the 8bit variant

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 19:49:29 +02:00
Michael Niedermayer
40a4ab8ba4 rename sub_hfyu_median_prediction_int16 to sub_hfyu_median_pred_int16
This makes the naming more consistent with the 8bit variant

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 19:48:23 +02:00
James Almer
05de4d3011 x86/dsputilenc: implement XOP version of pix_sum16
SSE2: 137 cycles
XOP:   87 cycles
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 18:40:23 +02:00
Diego Biurrun
368f50359e dsputil: Split off quarterpel bits into their own context 2014-05-29 06:48:31 -07:00
Diego Biurrun
054013a0fc dsputil: Move APE-specific bits into apedsp 2014-05-29 06:41:15 -07:00
Diego Biurrun
65d5d58658 dsputil: Move SVQ1 encoding specific bits into svq1enc 2014-05-29 06:41:15 -07:00
Michael Niedermayer
b50559fc0b libavcodec/x86/dsputilenc: drop and 0xffff that should have becomei redundant
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-29 00:16:52 +02:00
James Almer
561bfc85eb x86/dsputilenc: implement SSE2 versions of pix_{sum16, norm1}
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-28 23:29:34 +02:00
Christophe Gisquet
0810608e23 x86: hevc_mc: better register allocation
The xmm reg count was incorrect, and manual loading of the gprs
furthermore allows to noticeable reduce the number needed.

The modified functions are used in weighted prediction, so only a
few samples like WP_* exhibit a change. For this one and Win64
(some widths removed because of too few occurrences):

WP_A_Toshiba_3.bit, ff_hevc_put_hevc_uni_w
         16    32
before: 2194  3872
after:  2119  3767

WP_B_Toshiba_3.bit, ff_hevc_put_hevc_bi_w
         16    32    64
before: 2819  4960  9396
after:  2617  4788  9150

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-28 17:39:34 +02:00
Michael Niedermayer
48a6916308 Merge commit '512f3ffe9b4bb86767c2b1176554407c75fe1a5c'
* commit '512f3ffe9b4bb86767c2b1176554407c75fe1a5c':
  dsputil: Split off HuffYUV encoding bits into their own context

Conflicts:
	configure
	libavcodec/dsputil.c
	libavcodec/dsputil.h
	libavcodec/huffyuv.h
	libavcodec/huffyuvenc.c
	libavcodec/pngenc.c
	libavcodec/x86/dsputilenc_mmx.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-28 00:03:59 +02:00
Michael Niedermayer
e2abc0d5ca Merge commit '0d439fbede03854eac8a978cccf21a3425a3c82d'
* commit '0d439fbede03854eac8a978cccf21a3425a3c82d':
  dsputil: Split off HuffYUV decoding bits into their own context

Conflicts:
	configure
	libavcodec/dsputil.c
	libavcodec/dsputil.h
	libavcodec/huffyuv.h
	libavcodec/huffyuvdec.c
	libavcodec/lagarith.c
	libavcodec/vble.c
	libavcodec/x86/Makefile
	libavcodec/x86/dsputil.asm
	libavcodec/x86/dsputil_init.c
	libavcodec/x86/dsputil_mmx.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-27 23:16:06 +02:00
Diego Biurrun
512f3ffe9b dsputil: Split off HuffYUV encoding bits into their own context
Also shorten HuffYUV context member names to avoid clutter.
2014-05-27 08:54:53 -07:00
Diego Biurrun
0d439fbede dsputil: Split off HuffYUV decoding bits into their own context
Also shorten HuffYUV context member names to avoid clutter.
2014-05-27 08:52:34 -07:00
James Almer
5863207086 x86/dsputilenc: use HADDD in ff_sse16_sse2
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-27 15:12:50 +02:00
James Almer
e64e079ece x86/dsputilenc: implement SSE2 version of diff_pixels
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-27 05:55:11 +02:00
Michael Niedermayer
a0c5cd3475 avcodec/x86/dsputilenc: set the count of SSE registers correctly for get_pixels
Found-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-27 05:52:25 +02:00
Christophe Gisquet
86ae0da60c x86: hpeldsp: propagate changes across codecs
Some codecs still use mmx versions, so have them use the versions
with newer instruction sets.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-26 15:37:04 +02:00
Michael Niedermayer
a3950a90f6 Revert "x86: dsputilenc: convert ff_sse{8, 16}_mmx() to yasm"
This reverts commit ad733089b0.

breaks with --disable-yasm
revert requested by: Christophe Gisquet <christophe.gisquet@gmail.com>
2014-05-25 19:42:18 +02:00
Timothy Gu
ad733089b0 x86: dsputilenc: convert ff_sse{8, 16}_mmx() to yasm
Signed-off-by: Timothy Gu <timothygu99@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-25 16:30:08 +02:00
James Almer
d94e255dd1 x86/dsputilenc: make the SUM_ABS_DCTELEM macro more readable
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-25 02:03:54 +02:00
James Almer
61eea421b2 x86/dsputilenc: port sum_abs_dctelem functions to yasm
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-24 21:46:25 +02:00
Christophe Gisquet
81aa0f4604 x86: hpeldsp: implement SSSE3 version of _xy2
Loading pb_1 rather than pw_8192 was benchmarked to be more efficient.
Loading of the 2 yields no advantage. Loading of one saves ~11 cycles.

decicycles count:
put8:  3223(mmx)    -> 2387
avg8:  2863(mmxext) -> 2125
put16: 4356(sse2)   -> 3553
avg16: 4481(sse2)   -> 3513

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-24 15:15:56 +02:00
Christophe Gisquet
9722a6a3f3 x86: hpeldsp: implement SSE2 put_pixels16_xy2
This is obviously equivalent to the avg version, without the avg.

3223(mmx) -> 2006(sse2)

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-24 03:45:17 +02:00
Christophe Gisquet
f0aca50e0b x86: hpeldsp: implement SSE2 versions
Those are mostly used in codecs older than H.264, eg MPEG-2.

put16 versions:
      mmx  mmx2  sse2
x2:  1888  1185   552
y2:  1778  1092   510

avg16 xy2: 3509(mmx2) -> 2169(sse2)

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-24 03:29:48 +02:00
James Almer
7538ad2248 x86/hevc_deblock: improve chroma functions register allocation
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-24 01:16:26 +02:00
James Almer
584327f22f x86/dsputil: fix argument declaration in vector_clipf
Should fix fate failures in msvc x86_64

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-23 23:10:17 +02:00
James Almer
518cbf9b4a x86/dsputil: fix VECTOR_CLIP_INT32 macro
The inline loop was incrementing and using the value of %%i
the wrong way.

Disassembly of ff_vector_clip_int32_sse2 before and after
this patch:

    movdqa (%rdx),%xmm0      |  movdqa (%rdx),%xmm0
    movdqa 0x10(%rdx),%xmm1  |  movdqa 0x10(%rdx),%xmm1
    movdqa 0x20(%rdx),%xmm2  |  movdqa 0x20(%rdx),%xmm2
    movdqa 0x30(%rdx),%xmm3  |  movdqa 0x30(%rdx),%xmm3
[...]                        |
    movdqa %xmm0,(%rcx)      |  movdqa %xmm0,(%rcx)
    movdqa %xmm1,0x10(%rcx)  |  movdqa %xmm1,0x10(%rcx)
    movdqa %xmm2,0x20(%rcx)  |  movdqa %xmm2,0x20(%rcx)
    movdqa %xmm3,0x30(%rcx)  |  movdqa %xmm3,0x30(%rcx)
    movdqa (%rdx),%xmm0      |  movdqa 0x40(%rdx),%xmm0
    movdqa 0x20(%rdx),%xmm1  |  movdqa 0x50(%rdx),%xmm1
    movdqa 0x40(%rdx),%xmm2  |  movdqa 0x60(%rdx),%xmm2
    movdqa 0x60(%rdx),%xmm3  |  movdqa 0x70(%rdx),%xmm3
[...]                        |
    movdqa %xmm0,(%rcx)      |  movdqa %xmm0,0x40(%rcx)
    movdqa %xmm1,0x20(%rcx)  |  movdqa %xmm1,0x50(%rcx)
    movdqa %xmm2,0x40(%rcx)  |  movdqa %xmm2,0x60(%rcx)
    movdqa %xmm3,0x60(%rcx)  |  movdqa %xmm3,0x70(%rcx)
    add    $0x80,%rdx        |  add    $0x80,%rdx
    add    $0x80,%rcx        |  add    $0x80,%rcx

Other versions were unaffected.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-23 22:59:55 +02:00
James Almer
6a4832caae x86/diracdsp: mark all functions as yasm
No inline asm dirac code remains in the tree, so replace every relevant check.
This also moves all the dirac functions from dsputil_mmx.c to diracdsp_mmx.c

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-23 15:02:42 +02:00
James Almer
1d36defe94 x86/dsputil: port ff_vector_clipf_sse to yasm
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-23 00:08:21 +02:00
Christophe Gisquet
c081ca851c x86: hpeldsp: avg_pixels_xy2 for mmx2&3dnow
This is a port of the inline assembly of the mmx version to use the
pavg(us|)b instruction.

        8    16
mmx   1498  4355
mmx2  1242  3509

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-22 20:17:49 +02:00
Christophe Gisquet
17ac998055 x86: hpeldsp: mark _xy2 versions as approximate
Currently, only the mmx version is bitexact, the others (mmxext and
3dnow) are not, in spite of their naming.

Therefore, make their name more obvious. Also restore a comment that
was removed in 71155d7b.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-22 20:17:45 +02:00
Christophe Gisquet
f8de35ebc4 x86: hpeldsp: kill hpeldsp_mmx.c
before:
1987 decicycles in 8_x2, 262121 runs, 23 skips

after:
1902 decicycles in 8_x2, 262112 runs, 32 skips

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-22 20:17:40 +02:00
James Almer
80ee2dfcf6 x86/dsputil: port ff_put_signed_pixels_clamped_mmx to yasm
Also add an SSE2 version

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-21 23:33:45 +02:00
James Almer
7b05267239 x86/dsputil: port clear_block functions to yasm
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-21 23:33:45 +02:00
Michael Niedermayer
3d4e365073 avcodec/x86/hpeldsp_init: remove redundant if()
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-21 13:38:27 +02:00
Hendrik Leppkes
cd9e08e110 hpeldsp: fix build without inline asm
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-21 13:37:38 +02:00
Christophe Gisquet
d1a32c3f49 x86: kill fpel_mmx.c
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-21 03:25:08 +02:00
James Almer
d43c303038 x86/hevc_deblock: use constants instead of generating values at runtime
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-19 23:09:33 +02:00
James Almer
057ebf1222 x86/hevc_deblock: remove some duplicated instructions
Also remove a couple unnecessary cmps

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-18 23:28:17 +02:00
Christophe Gisquet
f1793fe9cd x86: hevc_mc: specify coefficients registers
By default, macro EPEL_FILTER loads the coefficients inconditionally
into m14/m15. This forces an unneeded higher register count.

Reduce that count by making them parameters of EPEL_FILTER.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-18 16:23:58 +02:00
Carl Eugen Hoyos
ef2713747f Fix compilation of libavcodec/x86/hevc_deblock.asm with nasm.
Suggested-by: Reimar
2014-05-17 12:50:55 +02:00
James Almer
be1fbc02b8 x86/hevc_deblock: use movhps instead of shuffling values
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-17 05:40:14 +02:00
James Almer
8aac77fede x86/hevc_deblock: fix label names
Also remove some unnecessary jmps

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-17 05:40:08 +02:00
James Almer
521eaea63a x86/hevc_deblock: fix usage of ABS1
The second argument is a temp register for non-SSSE3 cases

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-17 05:39:55 +02:00
James Almer
45110d2290 x86/hevc_deblock: merge movs with other instructions
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-17 05:39:34 +02:00
plepere
ef7c4cd001 avcodec/x86/hevc: updated to use x86util macros
Reviewed-by: James Almer <jamrial@gmail.com>
Reviewed-by: Ronald S. Bultje
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-16 21:11:07 +02:00
plepere
de7b89fd43 avcodec/x86/hevc: added DBF assembly functions
Reviewed-by: James Almer <jamrial@gmail.com>
Reviewed-by: Ronald S. Bultje
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-16 21:11:03 +02:00
Michael Niedermayer
bebce653e5 avcodec/x86/dsputil_mmx: Fix build with clang-usan
Found-by: Katerina Barone-Adesi
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-15 23:56:39 +02:00
Christophe Gisquet
d1310c591e x86: sbrdsp: implement SSE qmf_deint_neg
From 133 (unrolled av_intfloat32 C) to 59 cycles on Arrandale/Win64.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-15 23:11:18 +02:00
Hendrik Leppkes
87f2d8079a hevcdsp: correctly indicate that hevc_put_hevc_bi_epel_h uses 9 GPRs
Fixes FATE on Windows.

Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-12 17:00:48 +02:00
James Almer
8e07800001 hevcdsp: include stddef.h for ptrdiff_t definition
Including stdint.h was enough for systems like Mingw, but apparently not for Linux.
This should fix make checkheaders failures on every platform

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-10 18:23:30 +02:00
James Almer
fa23190a7a hevcdsp: add missing header include
Fixes make checkheaders

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-10 14:55:03 +02:00
Michael Niedermayer
341cacb9ac avcodec/x86/hevcdsp_init: fix build failure with --disable-mmx
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-09 05:16:27 +02:00
plepere
63832e01c3 hvcodec/x86/hevcdsp: make macros more modular to support functions that are not sse4
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-09 00:14:50 +02:00
Matt Oliver
1898c2f49d inline asm: fix arrays as named constraints.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-07 15:02:45 +02:00
Michael Niedermayer
fc7d0d8201 avcodec/x86/hevcdsp_init: fix SSE4 checks
Found-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-06 18:27:49 +02:00
Michael Niedermayer
7be230b5fa avcodec/x86/Makefile: remove duplicate line
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-06 18:23:42 +02:00
Michael Niedermayer
3b3db02f2e avcodec/x86/hevcdsp_init: fix build on 32bit
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-06 18:23:42 +02:00
plepere
7a2491c436 HEVC : added assembly MC functions
pretty print x86

Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-06 18:23:36 +02:00
Matt Oliver
ac9869ffb0 x86/mpegaudiodsp.c: msvc compilation error without sse/avx_external
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-06 14:15:02 +02:00
Michael Niedermayer
ebf2c2c3a8 avcodec/lossless_videodsp: fix incompatible pointer type warning
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-05-05 05:49:18 +02:00
Matt Oliver
3c3e02b8d1 x86/cavdsp: prevent named constraints appearing twice. 2014-05-03 17:47:55 +02:00
James Almer
5ac10d40fb x86/mpegaudiodsp: define apply_window_mp3 as SSE
None of the handwritten asm in this function seems to be SSE2

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-25 00:38:01 +02:00
Hendrik Leppkes
5809c2a99d vc1dsp: fix build without inline asm
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-22 14:01:53 +02:00
Clément Bœsch
62d31307c1 avcodec/x86/vp9lpf: add a comment above a bunch of SWAP. 2014-04-20 21:33:58 +02:00
Clément Bœsch
f0d368d758 avcodec/x86/vp9lpf: merge a few movs with other instructions. 2014-04-20 21:29:11 +02:00
Christophe Gisquet
319235c67c vc1dsp: introduce cases for 8x8 and 16x16
This allows further unrolling the DSP implementation where possible.

x86 and ARM DSP modified by simply moving the multiple calls from vc1dec
to the DSP code. Decoding improvements should only occurs because of the
compiler actually able to unroll more.

Decoding time: ~8.80s -> 8.64s (ie around 2%)

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-20 18:25:36 +02:00
Clément Bœsch
010732b73a vp9/x86: simplify FILTER_INIT.
In the 2 FILTER_INIT usages, the source is already preloaded so that
extra complexity taken from FILTER_UPDATE is not necessary.

Also add forgotten "mask" argument in FILTER_{INIT,UPDATE} comments.
2014-04-19 17:30:33 +02:00
Clément Bœsch
b8d002dc95 vp9/x86: clarify mixed splatb. 2014-04-19 17:00:51 +02:00
Carl Eugen Hoyos
b38910c979 Fix compilation with !HAVE_6REGS.
Can be tested with:
$ ./configure --cc='cc -m32' --disable-optimizations --enable-pic
2014-04-19 09:56:01 +02:00
Carl Eugen Hoyos
72c93abaad Use MANGLE in cavsdsp.c to save two registers using gcc.
Fixes compilation with !HAVE_6REGS.
2014-04-19 09:54:26 +02:00
James Almer
197fe392db x86/dsputil: use HADDD where applicable
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-17 14:15:35 +02:00
James Almer
76ed71a72b x86: move horizontal add macros to x86util
Also port relevant AVX2/XOP optimizations from x264 with permission
to relicense to LGPL from the corresponding authors

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-17 14:15:09 +02:00
Michael Niedermayer
46d5625f44 avcodec/x86/idct_sse2_xvid: fix non C99 inline function
Found-by: Matt Oliver <protogonoi@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-14 18:04:57 +02:00
James Almer
0f524b6c69 x86/synth_filter: remove the fma3 version ifdefs
This fixes compilation failures with --disable-fma3

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2014-04-13 11:29:28 +02:00
Timothy Gu
71c32ed533 DNxHD: convert inline asm to yasm 2014-04-11 12:09:09 +02:00
Timothy Gu
676856204b DNxHD: make get_pixel_8x4_sym accept ptrdiff_t as stride 2014-04-11 12:09:09 +02:00
Matt Oliver
d1e6e5c887 avcodec/x86: Exclude broken get_cabac under icl.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-10 17:47:22 +02:00
Matt Oliver
158a80cc0b Remove leal op to fix icl inline asm.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-07 13:02:54 +02:00
Hendrik Leppkes
fc7e02f0ff dcadsp: fix SSE code to not use SSE2 instructions.
movq from SSE register to memory is an SSE2 instruction.
Instead, use SSE movlps, which does the same thing.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-06 18:31:22 +02:00
Michael Niedermayer
e6f69b324e Merge commit '57b5b84e208ad61ffdd74ad849bed212deb92bc5'
* commit '57b5b84e208ad61ffdd74ad849bed212deb92bc5':
  x86: dsputil: Move ff_apply_window_int16_* bits to ac3dsp, where they belong

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 19:36:21 +02:00
Michael Niedermayer
e3c3f277a9 Merge commit 'c2c5be57494e6117086771bca34c8cd4c72c8e99'
* commit 'c2c5be57494e6117086771bca34c8cd4c72c8e99':
  x86: h264_qpel: Simplify an #if conditional

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 19:30:44 +02:00
Michael Niedermayer
ebb21887b8 Merge commit '01c5779f56cf708e6cb88b11cfdc248cae7e2ee8'
* commit '01c5779f56cf708e6cb88b11cfdc248cae7e2ee8':
  x86: Drop some unnecessary YASM ifdefs

Conflicts:
	libavfilter/x86/vf_yadif_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 19:16:39 +02:00
Michael Niedermayer
874f27a8f7 Merge commit 'b42f49e42f8cde25a788b2d13d03e99ca2956647'
* commit 'b42f49e42f8cde25a788b2d13d03e99ca2956647':
  x86: dsputil: Eliminate some unnecessary dsputil_x86.h #includes

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 19:05:00 +02:00
Michael Niedermayer
5440151fa4 Merge commit '3dc6272bed7890a49080e18eacf3c7a4a6594b0d'
* commit '3dc6272bed7890a49080e18eacf3c7a4a6594b0d':
  Remove a number of unnecessary dsputil.h #includes

Conflicts:
	libavcodec/h264pred.c
	libavcodec/vc1dsp.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 18:54:15 +02:00
James Almer
a1ac12bddd x86/dcadsp: add ff_dca_lfe_fir0_fma3
~10% faster than the SSE version.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 13:55:59 +02:00
James Almer
7d2116dd09 x86/synth_filter: compile avx and fma3 functions unconditionally
Fixes compilation failures with "--disable-{avx,fma3} --disable-optimizations"

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 05:15:27 +02:00
Michael Niedermayer
490d53e335 avcodec/x86/dcadsp_init: fix compilation failure without FMA3
alternatively the call could be put under #if or the #if
over the function removed

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-05 00:11:48 +02:00
Michael Niedermayer
51fd962c0b Merge commit 'c74b86699c86bdf62e8570f41d8a38be5710baa3'
* commit 'c74b86699c86bdf62e8570f41d8a38be5710baa3':
  x86/synth_filter: add synth_filter_fma3
  x86/synth_filter: add synth_filter_avx
  x86/synth_filter: add synth_filter_sse

Conflicts:
	libavcodec/x86/dcadsp.asm
	libavcodec/x86/dcadsp_init.c

See: 6467209836
See: 68c3ed936a
See: 7fd64e3e36
See: aa1f38015c
See: dfd865e51b
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-04 23:40:08 +02:00
Christophe Gisquet
dfd865e51b x86/synth_filter: remove the main loop when it's not needed
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-04 22:35:45 +02:00
Diego Biurrun
57b5b84e20 x86: dsputil: Move ff_apply_window_int16_* bits to ac3dsp, where they belong 2014-04-04 19:08:05 +02:00
Diego Biurrun
c2c5be5749 x86: h264_qpel: Simplify an #if conditional
The extra conditions are covered by previous #ifs and conditional compilation.
2014-04-04 19:08:05 +02:00
Diego Biurrun
01c5779f56 x86: Drop some unnecessary YASM ifdefs
Dead code elimination is enough to avoid undefined references in these cases.
2014-04-04 19:08:05 +02:00
Diego Biurrun
b42f49e42f x86: dsputil: Eliminate some unnecessary dsputil_x86.h #includes 2014-04-04 19:08:05 +02:00
Diego Biurrun
3dc6272bed Remove a number of unnecessary dsputil.h #includes 2014-04-04 19:08:05 +02:00
James Almer
c74b86699c x86/synth_filter: add synth_filter_fma3
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2014-04-04 17:40:51 +02:00
James Almer
81e02fae6e x86/synth_filter: add synth_filter_avx
Sandy Bridge Win64:
180 cycles in ff_synth_filter_inner_sse2
150 cycles in ff_synth_filter_inner_avx

Also switch some instructions to a three operand format to avoid
assembly errors with Yasm 1.1.0 or older.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2014-04-04 17:40:51 +02:00
James Almer
2025d8026f x86/synth_filter: add synth_filter_sse
Build only on x86_32 targets.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2014-04-04 17:40:51 +02:00
Michael Niedermayer
fb61ed1e9f Merge commit 'ac4b32df71bd932838043a4838b86d11e169707f'
* commit 'ac4b32df71bd932838043a4838b86d11e169707f':
  On2 VP7 decoder

Conflicts:
	Changelog
	libavcodec/arm/h264pred_init_arm.c
	libavcodec/arm/vp8dsp.h
	libavcodec/arm/vp8dsp_init_arm.c
	libavcodec/arm/vp8dsp_init_armv6.c
	libavcodec/arm/vp8dsp_init_neon.c
	libavcodec/avcodec.h
	libavcodec/h264pred.c
	libavcodec/version.h
	libavcodec/vp8.c
	libavcodec/vp8.h
	libavcodec/vp8data.h
	libavcodec/vp8dsp.c
	libavcodec/vp8dsp.h
	libavcodec/x86/h264_intrapred_init.c
	libavcodec/x86/vp8dsp_init.c

See: 89f2f5dbd7 and others
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-04 14:46:10 +02:00
Peter Ross
ac4b32df71 On2 VP7 decoder
Further performance improvements and security fixes by
Vittorio Giovara, Luca Barbato and Diego Biurrun.

Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2014-04-04 04:00:11 +02:00
Matt Oliver
0f2588d7e5 Use intel compliant CDQ instead of CLTD in inline asm.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-30 23:14:36 +02:00
Clément Bœsch
c4148a6668 x86/vp9mc: add vp9 namespace. 2014-03-29 18:13:15 +01:00
Timothy Gu
9d34dce05b x86: convert DNxHDenc inline asm to yasm
Signed-off-by: Timothy Gu <timothygu99@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-27 23:16:17 +01:00
Timothy Gu
cb11b9e89e dnxhdenc: make get_pixel_8x4_sym accept ptrdiff_t as stride
Signed-off-by: Timothy Gu <timothygu99@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-27 23:09:10 +01:00
Michael Niedermayer
4998a72b49 Merge remote-tracking branch 'qatar/master'
* qatar/master:
  x86: hpeldsp: Keep all rnd_template instantiations in hpeldsp_init

Conflicts:
	libavcodec/x86/rnd_mmx.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-26 16:55:46 +01:00
Michael Niedermayer
0371eaebcd Merge commit 'aba70bb5387f12dfa5e6cd8cb861c9c7e668151f'
* commit 'aba70bb5387f12dfa5e6cd8cb861c9c7e668151f':
  Add missing headers to make template files compile (more) standalone

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-26 14:50:55 +01:00
Diego Biurrun
efc7290eb6 x86: hpeldsp: Keep all rnd_template instantiations in hpeldsp_init
There is no point in having a separate file just for the instantiation
that provides the public functions.
2014-03-26 04:31:27 -07:00
Diego Biurrun
aba70bb538 Add missing headers to make template files compile (more) standalone 2014-03-26 04:31:27 -07:00
Diego Biurrun
d0aabeab23 x86: h264_qpel: Fix typo in CALL_2X_PIXELS macro invocation
This fixes FATE with mmxext CPUFLAGS set.
2014-03-26 12:00:01 +01:00
Peter Ross
a490970af2 libavcodec/*/vp8dsp_init: indent
Signed-off-by: Peter Ross <pross@xvid.org>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-25 13:29:29 +01:00
Peter Ross
89f2f5dbd7 On2 VP7 decoder
Signed-off-by: Peter Ross <pross@xvid.org>
Reviewed-by: BBB
previous patch reviewed by jason
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-25 13:29:05 +01:00
Michael Niedermayer
c25d2cd20b avcodec/x86/mpegvideoenc_template: fix integer overflow
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-25 00:15:52 +01:00
Michael Niedermayer
c8246d3766 avcodec/x86/h264_qpel: Fix typo introduced by 322a1dda97
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-23 15:04:53 +01:00
Michael Niedermayer
74fed968d1 Merge commit '82dd1026cfc1d72b04019185bea4c1c9621ace3f'
* commit '82dd1026cfc1d72b04019185bea4c1c9621ace3f':
  x86: dsputil: Move hpeldsp-related declarations to a separate header

Conflicts:
	libavcodec/x86/dsputil_x86.h

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-22 23:21:54 +01:00
Michael Niedermayer
9333bba6ed Merge commit '6655c933a887a2d20707fff657b614aa1d86a25b'
* commit '6655c933a887a2d20707fff657b614aa1d86a25b':
  x86: dsputil: Move fpel declarations to a separate header

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-22 23:08:22 +01:00
Michael Niedermayer
77bc342975 Merge commit '322a1dda973e802db7b57f2007fad3efcd5bab81'
* commit '322a1dda973e802db7b57f2007fad3efcd5bab81':
  dsputil: Refactor duplicated CALL_2X_PIXELS / PIXELS16 macros

Conflicts:
	libavcodec/arm/hpeldsp_init_arm.c
	libavcodec/x86/dsputil_x86.h

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-22 22:53:33 +01:00
Michael Niedermayer
d6d3cfb0aa Merge commit '600b854ad8173995518bd917e7f86120b5505088'
* commit '600b854ad8173995518bd917e7f86120b5505088':
  imgconvert: Move ff_deinterlace_line_*_mmx declarations out of dsputil

Conflicts:
	libavcodec/imgconvert.c
	libavcodec/x86/dsputil_x86.h

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-22 22:17:54 +01:00
Michael Niedermayer
8fbc6e5911 Merge commit '1a8d0cf77ed2611e542ae98f341d4c43a04467bd'
* commit '1a8d0cf77ed2611e542ae98f341d4c43a04467bd':
  x86: dsputil: Move inline assembly macros to a separate header

Conflicts:
	libavcodec/x86/dsputil_mmx.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-22 22:11:27 +01:00
Diego Biurrun
82dd1026cf x86: dsputil: Move hpeldsp-related declarations to a separate header 2014-03-22 06:17:29 -07:00
Diego Biurrun
6655c933a8 x86: dsputil: Move fpel declarations to a separate header 2014-03-22 06:17:29 -07:00
Diego Biurrun
322a1dda97 dsputil: Refactor duplicated CALL_2X_PIXELS / PIXELS16 macros 2014-03-22 06:17:29 -07:00
Diego Biurrun
600b854ad8 imgconvert: Move ff_deinterlace_line_*_mmx declarations out of dsputil 2014-03-22 06:17:29 -07:00
Diego Biurrun
1a8d0cf77e x86: dsputil: Move inline assembly macros to a separate header 2014-03-22 06:17:29 -07:00
Matt Oliver
cd5cf395f6 Additional icl inline asm fix.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-22 14:07:03 +01:00
Michael Niedermayer
1cd107f637 avcodec/x86/snowdsp: add missing clobbers to inner_add_yblock_bw_8_obmc_16_bh_even_sse2() and inner_add_yblock_bw_16_obmc_32_sse2()
Note, these functions are currently disabled

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-21 03:38:48 +01:00
Michael Niedermayer
e98bac82e5 Merge commit '82bb3048013201c0095d2853d4623633d912252f'
* commit '82bb3048013201c0095d2853d4623633d912252f':
  dsputil: Use correct type in me_cmp_func function pointer

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-20 22:36:40 +01:00
Michael Niedermayer
011d83de48 Merge commit '0e083d7e43805db1a978cb57bfa25fda62e8ff18'
* commit '0e083d7e43805db1a978cb57bfa25fda62e8ff18':
  build: Group general components separate from de/encoders in arch Makefiles

Conflicts:
	libavcodec/arm/Makefile
	libavcodec/x86/Makefile

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-20 22:26:31 +01:00
Michael Niedermayer
ba85bfabf3 Merge commit '5169e688956be3378adb3b16a93962fe0048f1c9'
* commit '5169e688956be3378adb3b16a93962fe0048f1c9':
  dsputil: Propagate bit depth information to all (sub)init functions

Conflicts:
	libavcodec/arm/dsputil_init_arm.c
	libavcodec/arm/dsputil_init_armv5te.c
	libavcodec/arm/dsputil_init_armv6.c
	libavcodec/arm/dsputil_init_neon.c
	libavcodec/dsputil.c
	libavcodec/dsputil.h
	libavcodec/ppc/dsputil_ppc.c
	libavcodec/x86/dsputil_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-20 22:06:01 +01:00
Diego Biurrun
82bb304801 dsputil: Use correct type in me_cmp_func function pointer 2014-03-20 05:03:23 -07:00
Diego Biurrun
0e083d7e43 build: Group general components separate from de/encoders in arch Makefiles
This is in line with how the top-level libavcodec Makefile is structured.
2014-03-20 05:03:23 -07:00
Diego Biurrun
5169e68895 dsputil: Propagate bit depth information to all (sub)init functions
This avoids recalculating the value over and over again.
2014-03-20 05:03:23 -07:00
Carl Eugen Hoyos
57fdc74c34 Add one forgotten named inline asm operand in libavcodec/x86/motion_est.c. 2014-03-19 03:00:19 +01:00
Matt Oliver
8236747511 Automatically change MANGLE() into named inline asm operands when direct symbol reference in inline asm are not supported.
This is part of the patch-set for intel C inline asm on windows support

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-18 23:39:30 +01:00
Matt Oliver
b2d3a45598 avcodec/x86/mlpdsp: Only use asm when non-local inline asm lables are supported
This is part of the patch-set for intel C inline asm on windows support

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-18 23:37:50 +01:00
James Almer
aa1f38015c x86/synth_filter: improve FMA version
Replace mulps+subps with fnmaddps, resulting in two less instructions inside the
inner loops.
About 1% faster FMA3 performance.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-17 21:04:15 +01:00
Matt Oliver
b73aae6fe9 avcodec/x86/idct_sse2_xvid: move offsets out of MANGLE()
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-17 04:19:59 +01:00
Matt Oliver
9eb3f11c55 Add missing external declarations.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-17 00:48:09 +01:00
Matt Oliver
590805b7c3 Fixed 64bit conformance with mvzbl.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-17 00:13:50 +01:00
Michael Niedermayer
5dd97d5809 Merge commit 'db3f61a04f1f66746660f921bb2780ddf1141f3b'
* commit 'db3f61a04f1f66746660f921bb2780ddf1141f3b':
  x86: dsputil_init: Drop some unnecessary parentheses

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 01:25:57 +01:00
Michael Niedermayer
27cab16ce7 Merge commit '441b093915717afa7d24be34bdab2a4911b30a57'
* commit '441b093915717afa7d24be34bdab2a4911b30a57':
  x86: dsputil_init: K&R formatting cosmetics

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 01:25:36 +01:00
Michael Niedermayer
236874a571 Merge commit '4cb4680c1087a2cd13d4b0c9167a2eb3147f99d8'
* commit '4cb4680c1087a2cd13d4b0c9167a2eb3147f99d8':
  x86: dsputil_x86.h: K&R formatting cosmetics

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 01:25:19 +01:00
Michael Niedermayer
925ce6faf4 Merge commit 'f8bbebecfd7ea3dceb7c96f931beca33f80a3490'
* commit 'f8bbebecfd7ea3dceb7c96f931beca33f80a3490':
  x86: motion_est: K&R formatting cosmetics

Conflicts:
	libavcodec/x86/motion_est.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 01:20:43 +01:00
Michael Niedermayer
b7a5f5dc66 Merge commit 'a36947c167d7278b891453083b57dc56b7a7f5c5'
* commit 'a36947c167d7278b891453083b57dc56b7a7f5c5':
  dsputilenc_mmx: K&R formatting cosmetics

Conflicts:
	libavcodec/x86/dsputilenc_mmx.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 01:09:57 +01:00
Michael Niedermayer
d926c4b240 Merge commit '38675229a879aa5258a8c71891fc8cbf74cf139f'
* commit '38675229a879aa5258a8c71891fc8cbf74cf139f':
  dsputil_mmx: K&R formatting cosmetics

Conflicts:
	libavcodec/x86/dsputil_mmx.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 01:01:37 +01:00
Michael Niedermayer
55f53f6c29 Merge commit '6a8b35dc88b4a1a452f192fbbf53ae7f59bc3f23'
* commit '6a8b35dc88b4a1a452f192fbbf53ae7f59bc3f23':
  dsputilenc_mmx: Merge two assignment blocks with identical conditions

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 00:57:25 +01:00
Michael Niedermayer
4104eb44e6 Merge commit '55519926ef855c671d084ccc151056de9e3d3a77'
* commit '55519926ef855c671d084ccc151056de9e3d3a77':
  x86: Make function prototype comments in assembly code consistent

Conflicts:
	libavcodec/x86/sbrdsp.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 00:01:30 +01:00
Michael Niedermayer
a9b1936a4e Merge commit 'edd1f833fa145eb9c5026877c699ebe6efca00a0'
* commit 'edd1f833fa145eb9c5026877c699ebe6efca00a0':
  x86: h264_idct_10_bit: Use proper type in function prototype comments

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-14 00:00:16 +01:00
Michael Niedermayer
1c788eaca9 Merge commit '831a1180785a786272cdcefb71566a770bfb879e'
* commit '831a1180785a786272cdcefb71566a770bfb879e':
  Update dsputil- and SIMD-related comments to match reality more closely

Conflicts:
	libavcodec/x86/hpeldsp.asm
	libavutil/arm/float_dsp_init_arm.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-13 23:59:56 +01:00
Michael Niedermayer
d61e1156be Merge commit '17608f6ee3d2088cdb8d1e704276d8b34f01160d'
* commit '17608f6ee3d2088cdb8d1e704276d8b34f01160d':
  x86: Add some more missing headers

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-13 23:41:17 +01:00
Diego Biurrun
db3f61a04f x86: dsputil_init: Drop some unnecessary parentheses 2014-03-13 08:15:51 -07:00
Diego Biurrun
441b093915 x86: dsputil_init: K&R formatting cosmetics 2014-03-13 08:15:51 -07:00
Diego Biurrun
4cb4680c10 x86: dsputil_x86.h: K&R formatting cosmetics 2014-03-13 08:15:51 -07:00
Diego Biurrun
f8bbebecfd x86: motion_est: K&R formatting cosmetics 2014-03-13 08:15:51 -07:00
Diego Biurrun
a36947c167 dsputilenc_mmx: K&R formatting cosmetics 2014-03-13 08:15:51 -07:00
Diego Biurrun
38675229a8 dsputil_mmx: K&R formatting cosmetics 2014-03-13 08:15:51 -07:00
Diego Biurrun
6a8b35dc88 dsputilenc_mmx: Merge two assignment blocks with identical conditions 2014-03-13 08:15:51 -07:00
Diego Biurrun
55519926ef x86: Make function prototype comments in assembly code consistent
This helps grepping for functions, among other things.
2014-03-13 05:50:29 -07:00
Diego Biurrun
edd1f833fa x86: h264_idct_10_bit: Use proper type in function prototype comments 2014-03-13 05:50:29 -07:00
Diego Biurrun
831a118078 Update dsputil- and SIMD-related comments to match reality more closely 2014-03-13 05:50:29 -07:00
Diego Biurrun
17608f6ee3 x86: Add some more missing headers 2014-03-13 05:50:28 -07:00
Diego Biurrun
08dba0e1c3 x86: mpegvideoenc: Remove some remnants of the long-gone libmpeg2 IDCT 2014-03-13 05:50:28 -07:00
James Almer
9e0e1f9067 x86/dsputil: add emms to ff_scalarproduct_int16_mmxext()
Also undo the changes to ra144enc.c from previous commits.
Should fix ticket #3429

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-06 18:23:55 +01:00
Michael Niedermayer
2d99de66b7 Merge commit '3bfdee00cd92ff07c364d4901c4aefda32780756'
* commit '3bfdee00cd92ff07c364d4901c4aefda32780756':
  x86: dcadsp: Fix linking with yasm and optimizations disabled

Conflicts:
	libavcodec/x86/dcadsp_init.c

See: 206167a295
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-06 14:10:27 +01:00
Diego Biurrun
3bfdee00cd x86: dcadsp: Fix linking with yasm and optimizations disabled
Some optimized functions reference optimized symbols, so the functions
must be explicitly disabled when those symbols are unavailable.
2014-03-05 23:16:21 +01:00
Michael Niedermayer
146b476ba0 Merge commit '3741aa37c2a0d0717faff74a5c4cc357d16f6d1d'
* commit '3741aa37c2a0d0717faff74a5c4cc357d16f6d1d':
  x86: cabac: Use correct #includes to make header compile standalone

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-05 21:33:44 +01:00
Diego Biurrun
3741aa37c2 x86: cabac: Use correct #includes to make header compile standalone 2014-03-05 13:32:25 +01:00
James Almer
7fd64e3e36 x86/synth_filter: add synth_filter_fma3
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-05 01:58:16 +01:00
James Almer
206167a295 x86/synth_filter: add missing HAVE_YASM guard
Should fix compilation failures with --disable-yasm on some compilers

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-04 22:47:28 +01:00
James Almer
884e085d1e x86/synth_filter: Revert the switch to float ops with SSE2
This reverts the changes 6467209836
and 68c3ed936a did to the SSE2 version,
which generated a hit of about 5 cycles.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-02 11:58:10 +01:00
James Almer
68c3ed936a x86/synth_filter: add synth_filter_avx
Sandy Bridge Win64:
180 cycles on ff_synth_filter_inner_sse2
150 cycles on ff_synth_filter_inner_avx

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-02 01:00:55 +01:00
James Almer
6467209836 x86/synth_filter: add synth_filter_sse
Build only on x86_32 targets.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-01 15:32:40 +01:00
Michael Niedermayer
fb3c33f3cd Merge commit '4cb6964244fd6c099383d8b7e99731e72cc844b9'
* commit '4cb6964244fd6c099383d8b7e99731e72cc844b9':
  dcadec: simplify decoding of VQ high frequencies

Conflicts:
	configure
	libavcodec/dcadec.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 21:41:19 +01:00
Michael Niedermayer
baf3adc621 Merge commit '08e3ea60ff4059341b74be04a428a38f7c3630b0'
* commit '08e3ea60ff4059341b74be04a428a38f7c3630b0':
  x86: synth filter float: implement SSE2 version

Conflicts:
	libavcodec/x86/dcadsp.asm
	libavcodec/x86/dcadsp_init.c

See: 2cdbcc0048
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 20:38:39 +01:00
Christophe Gisquet
2cdbcc0048 x86: synth filter float: implement SSE2 version
Timings for Arrandale:
          C    SSE
win32:  2108   334
win64:  1152   322

Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.

Unrolling for ARCH_X86_64 is a 20 cycles gain.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 20:34:40 +01:00
Michael Niedermayer
e346a59383 Merge commit 'ad507d7907457e678900bac132122ba7be4644cb'
* commit 'ad507d7907457e678900bac132122ba7be4644cb':
  x86: dcadsp: implement SSE lfe_dir

Conflicts:
	libavcodec/x86/dcadsp.asm

See: 169243112c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 19:22:00 +01:00
Christophe Gisquet
169243112c x86: dcadsp: implement SSE lfe_dir
Results for Arrandale/Windows:
32: 1670 -> 316
64:  728 -> 298

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 19:20:03 +01:00
Michael Niedermayer
5ba1648318 Merge commit 'b23650491fbd579a4365f42bd42575afb7b53f7e'
* commit 'b23650491fbd579a4365f42bd42575afb7b53f7e':
  prores: Use consistent names for DSP arch initialization functions

Conflicts:
	libavcodec/proresdsp.c
	libavcodec/proresdsp.h
	libavcodec/x86/proresdsp_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-28 17:13:00 +01:00
Christophe Gisquet
4cb6964244 dcadec: simplify decoding of VQ high frequencies
The vector dequantization has a test in a loop preventing effective SIMD
implementation. By moving it out of the loop, this loop can be DSPized.

Therefore, modify the current DSP implementation. In particular, the
DSP implementation no longer has to handle null loop sizes.

The decode_hf implementations have following timings:

For x86 Arrandale:
        C  SSE SSE2 SSE4
win32: 260 162  119  104
win64: 242 N/A   89   72

The arm NEON optimizations follow in a later patch as external asm. The
now unused check for the y modifier in arm inline asm is removed from
configure.
2014-02-28 13:03:22 +01:00
Christophe Gisquet
08e3ea60ff x86: synth filter float: implement SSE2 version
Timings for Arrandale:
          C    SSE
win32:  2108   334
win64:  1152   322

Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.

Unrolling for ARCH_X86_64 is a 20 cycles gain.

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-02-28 13:00:48 +01:00
Christophe Gisquet
ad507d7907 x86: dcadsp: implement SSE lfe_dir
Results for Arrandale/Windows:
32: 1670 -> 316
64:  728 -> 298

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-02-28 13:00:47 +01:00
Diego Biurrun
b23650491f prores: Use consistent names for DSP arch initialization functions 2014-02-28 10:34:55 +01:00
James Almer
2163a40a46 x86/imdct36: use sse3 instructions in the last BUTTERF step when possible
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-27 23:28:15 +01:00
James Almer
fbf98375e4 x86/imdct36: don't build imdct36_float_sse on x86_64 targets
There's an SSE2 version as well, and x86_64 guarantees that
instruction set is present.

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-27 22:54:03 +01:00
James Almer
3f3d748cab x86: Move XOP emulation to x86util
We need the emulation to support the cases where the first
argument is the same as the fourth. To achieve this a fifth
argument working as a temporary may be needed.
Emulation that doesn't obey the original instruction semantics
can't be in x86inc.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-24 08:30:19 +01:00
Michael Niedermayer
8372aaf721 Merge commit '017a06a9ee86b047079166c2694c9c655ff03356'
* commit '017a06a9ee86b047079166c2694c9c655ff03356':
  x86: dsputil: Use correct file name as multiple inclusion guard

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-20 14:58:04 +01:00
Diego Biurrun
017a06a9ee x86: dsputil: Use correct file name as multiple inclusion guard 2014-02-20 04:16:15 -08:00
Michael Niedermayer
130c33af35 Merge commit 'b23bc95920e2f10b9621857e829c45b064f356c0'
* commit 'b23bc95920e2f10b9621857e829c45b064f356c0':
  x86: dca: Add missing multiple inclusion guards

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-19 15:44:48 +01:00
Diego Biurrun
b23bc95920 x86: dca: Add missing multiple inclusion guards 2014-02-19 10:19:15 +01:00
Hendrik Leppkes
7716eda0aa vp9/x86: set correct number of registers used in intra pred asm
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-18 17:20:14 +01:00
James Almer
07b4b0ca62 tta/x86: add ff_ttafilter_process_dec_{ssse3, sse4}
Results are from a Win64 build running on an AMD FX 6300

1121 decicycles in ttafilter_process_dec_c, 16777112 runs, 104 skips
522 decicycles in ff_ttafilter_process_dec_ssse3, 16777149 runs, 67 skips
477 decicycles in ff_ttafilter_process_dec_sse4, 16777156 runs, 60 skips

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-17 13:51:19 +01:00
Ronald S. Bultje
fdb093c4e4 vp9/x86: intra prediction SIMD.
Partially based on h264_intrapred. (I hope to eventually merge these
two intrapred implementations back together.)
2014-02-17 13:39:00 +01:00
James Almer
ec482e738d x86/fladsp: add missing check to ff_flacdsp_init_x86()
Fixes compilation with flac decoder disabled and encoder enabled

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-16 12:06:04 +01:00
Michael Niedermayer
d601106ab1 avcodec/x86/lossless_videodsp: fix w type
Fixes fate issues on mingw64

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-15 06:41:38 +01:00
Peter Ross
b8664c9294 avcodec/vp8dsp: add VP7 idct and loop filter
Signed-off-by: Peter Ross <pross@xvid.org>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-15 02:15:35 +01:00
James Almer
e87974bc00 flac/x86: add ff_flac_lpc_32_xop()
Tested on an AMD FX 6300

679081 decicycles in ff_flac_lpc_32_xop, 32768 runs
774425 decicycles in ff_flac_lpc_32_sse4, 32768 runs

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-13 22:14:59 +01:00
James Darnley
623f380a18 lavc: fix flac encoder and decoder dependencies
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-13 21:00:32 +01:00
Michael Niedermayer
df98b36aa6 Merge commit '5c1c6e82261b856214499b9fef3a08bf3ff6e0ae'
* commit '5c1c6e82261b856214499b9fef3a08bf3ff6e0ae':
  dca: include dcadsp.h in {arm,x86}/dca.h for checkheaders

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-08 17:25:31 +01:00
Michael Niedermayer
dd2b330347 Merge commit '0cffd6fff59f192120dc93aa6c3cb8180f5506e3'
* commit '0cffd6fff59f192120dc93aa6c3cb8180f5506e3':
  x86: use the inline int8x8_fmul_int32 only if inline SSE2 is availbale

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-08 17:06:57 +01:00
Janne Grunau
5c1c6e8226 dca: include dcadsp.h in {arm,x86}/dca.h for checkheaders 2014-02-08 13:38:36 +01:00
Janne Grunau
0cffd6fff5 x86: use the inline int8x8_fmul_int32 only if inline SSE2 is availbale
Fixes compilation with MSVC. Also does not rely on on earlier config.h
include but include it directly.
2014-02-08 12:10:56 +01:00
Clément Bœsch
669d4f9053 x86/vp9lpf: simplify 2nd transpose in 44/48/88/84.
For non-avx optims, this saves 8 movs.

before:
  1785 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524129 runs, 159 skips
  3327 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262116 runs, 28 skips
  2712 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193729 runs, 575 skips
  3237 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524061 runs, 227 skips

after:
  1768 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524062 runs, 226 skips
  3310 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262107 runs, 37 skips
  2719 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193954 runs, 350 skips
  3184 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524236 runs, 52 skips
2014-02-08 11:10:23 +01:00
Michael Niedermayer
82ae8a44e6 Merge commit '5b59a9fc6152169599561f04b4f66370edda5c9c'
* commit '5b59a9fc6152169599561f04b4f66370edda5c9c':
  x86: dcadsp: implement int8x8_fmul_int32

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-08 01:20:33 +01:00
Christophe Gisquet
5b59a9fc61 x86: dcadsp: implement int8x8_fmul_int32
For the callable function (as opposed to the inline one):
         C  SSE  SSE2  SSE4
Win32:  47   42   29    26
Win64:  30   33   25    23
The SSE version is neither compiled nor set for ARCH_X86_64, as the
inlinable function takes over.

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-02-07 22:52:40 +01:00
Loren Merritt
9c978f243a flac/x86: add ff_flac_lpc_32_sse4()
benchmarked on sandybridge x86_64:
1358232 decicycles in flac_lpc_32_c
1244575 decicycles in flac_lpc_32_sse4, James Almer's patch
 650045 decicycles in flac_lpc_32_sse4, this patch

I haven't tested the edgecases such as odd block lengths

odd block length tested-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-06 02:51:19 +01:00
Clément Bœsch
d92a725329 x86/vp9lpf: remove 8 SWAPs in 84/48 transpose. 2014-02-05 07:21:13 +01:00
Clément Bœsch
97dde561de x86/vp9lpf: remove braindead double pxor. 2014-02-05 07:21:11 +01:00
Clément Bœsch
9a3b05b0a9 x86/vp9lpf: save a few mov in flat8in/hev masks calc. 2014-02-05 07:21:09 +01:00
Clément Bœsch
91d85bb167 x86/vp9lpf: add ff_vp9_loop_filter_[vh]_44_16_{sse2,ssse3,avx}. 2014-02-05 07:21:06 +01:00
Michael Niedermayer
de17ccc774 Merge commit '51daafb02eaf96e0743a37ce95a7f5d02c1fa3c2'
* commit '51daafb02eaf96e0743a37ce95a7f5d02c1fa3c2':
  x86: videodsp: Properly mark sse2 instructions in emulated_edge_mc as such.

Conflicts:
	libavcodec/x86/videodsp_init.c

See: 1b3a7e1f42
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-31 14:30:30 +01:00
Clément Bœsch
c5dd73b890 x86/vp9lpf: add ff_vp9_loop_filter_h_{48,84}_16_{sse2,ssse3,avx}().
5.40s → 5.30s overall decode time with -threads 1 on ped1080p.webm
(i7 920, ssse3)
2014-01-30 19:34:13 +01:00
Ronald S. Bultje
9ee9c679a7 x86: videodsp: Fix a bug in a %if statement where we used '%%' instead of '&&'.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-01-30 15:33:23 +01:00
Ronald S. Bultje
51daafb02e x86: videodsp: Properly mark sse2 instructions in emulated_edge_mc as such.
Should fix crashes or corrupt output on pre-SSE2 CPUs when they were
using SSE2-code (e.g. AMD Athlon XP 2400+ or Intel Pentium III) in
hfix or hvar single-edge (left/right) extension functions.

Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2014-01-30 15:30:01 +01:00
James Almer
644c32ea4b x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_sse2()
Similar gains as the ssse3 version once again

Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-28 09:30:55 +01:00
Clément Bœsch
222c46c531 x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_{ssse3,avx}.
9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips
9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips

1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips
2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips

5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1)

Adding SSE2 support should be relatively trivial (just a matter of
changing the pshufb [mask_mix] with something else), patch welcome.
2014-01-28 07:36:38 +01:00
Clément Bœsch
822385d775 x86/vp9lpf: add a preload system in FILTER_UPDATE.
Allow some macro refactoring in filter14().
2014-01-27 22:39:26 +01:00
Clément Bœsch
315b4775ad x86/vp9lpf: refactor v/h using common macros for P7 to Q7. 2014-01-27 22:39:26 +01:00
Clément Bœsch
5d144086cc x86/vp9lpf: faster P7..Q7 accesses.
Introduce 2 additional registers for stride3 and mstride3 to allow
direct accesses (lea drops).

3931 → 3827 decicycles in ff_vp9_loop_filter_v_16_16_ssse3

Also uses defines to clarify the code.
2014-01-27 22:37:42 +01:00
Clément Bœsch
5f4d04d084 x86/lossless_videodsp: silly one-line cosmetic. 2014-01-25 16:24:50 +01:00
Clément Bœsch
5267e85056 x86/lossless_videodsp: use common macro for add and diff int16 loop. 2014-01-25 14:27:37 +01:00
Clément Bœsch
cddbfd2a95 x86/lossless_videodsp: simplify and explicit aligned/unaligned flags 2014-01-25 11:59:43 +01:00
Ronald S. Bultje
c9e6325ed9 vp9/x86: use explicit register for relative stack references.
Before this patch, we explicitly modify rsp, which isn't necessarily
universally acceptable, since the space under the stack pointer might
be modified in things like signal handlers. Therefore, use an explicit
register to hold the stack pointer relative to the bottom of the stack
(i.e. rsp). This will also clear out valgrind errors about the use of
uninitialized data that started occurring after the idct16x16/ssse3
optimizations were first merged.
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
97474d527f vp9/x86: iwht4x4 (lossless) mmx. 2014-01-24 19:25:25 -05:00
Ronald S. Bultje
d43efa68bd vp9/x86: 4x4 iadst SIMD (ssse3) variants.
Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct:    66 -> 67 cycles (noise measurement)
idct_iadst:  199 -> 79 cycles
iadst_idct:  165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles
2014-01-24 19:25:25 -05:00
Ronald S. Bultje
baf47020cd vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants.
Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct:   133 -> 135 cycles (noise measurement)
idct_iadst:  900 -> 241 cycles
iadst_idct:  864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles
2014-01-24 19:25:25 -05:00
Michael Niedermayer
e6d1c66d74 avcodec/x86/lossless_videodsp: disable median optimizations for 16bps
They only support upto 15bps

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:51:24 +01:00
Michael Niedermayer
eaacfc7dd1 avcodec/lossless_videodsp: Pass AVCodecContext to init
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-23 01:43:00 +01:00
Michael Niedermayer
ef00ef7553 avcodec/x86/lossless_videodsp: port sub_hfyu_median_prediction_int16 to yasm
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 23:27:27 +01:00
Michael Niedermayer
fad49aae28 avcodec/x86/lossless_videodsp: Port sub_hfyu_median_prediction_mmxext to int16
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 22:55:49 +01:00
Michael Niedermayer
fee97f25fa avcodec/x86/lossless_videodsp: port add_hfyu_median_prediction_mmxext to 16bit
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 21:11:40 +01:00
Michael Niedermayer
631939bde6 avcodec/x86/lossless_videodsp: add diff_int16_mmx/sse2
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-22 19:41:21 +01:00
Reimar Döffinger
76421982d0 lossless_videodsp.asm: fix compilation.
Fixes these errors with nasm:
libavcodec/x86/lossless_videodsp.asm:86: error: invalid combination of opcode and operands
libavcodec/x86/lossless_videodsp.asm:88: error: invalid combination of opcode and operands
I don't know whether movd or movq was meant, but either way
maskq vs. maskd must match the mov size.

Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
2014-01-21 19:46:02 +01:00
Michael Niedermayer
83b67ca056 avcodec/x86/lossless_videodsp: Port lorens add_hfyu_left_prediction_ssse3/sse4 to 16bit
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:55:41 +01:00
Michael Niedermayer
63d2be7533 avcodec/x86/lossless_videodsp: use SPLATW in add_int16
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-21 02:33:20 +01:00
Michael Niedermayer
f70d7eb20c Move add/diff_int16 to lossless_videodsp
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 21:32:47 +01:00
Michael Niedermayer
a493f8541d avcodec/x86/dsp: add_int16_mmx / add_int16_sse2
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-20 04:06:46 +01:00
James Almer
26800e3864 vp9/x86: rename ff_avg[48]_sse to ff_avg[48]_mmxext
pavgb is an sse integer instruction, so the mmxext flag is enough

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-18 17:08:25 +01:00
James Almer
d2a7314f1e vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2().
Similar gains in performance as the SSSE3 version

Signed-off-by: James Almer <jamrial@gmail.com>
2014-01-17 14:16:38 +01:00
Ronald S. Bultje
8173d1ffc0 vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx).
Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct:  4672 -> 1175 cycles
idct_iadst:  4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.
2014-01-16 13:49:31 +01:00
Clément Bœsch
9cc8fa63dd vp9/x86: simplify a few mc inits. 2014-01-16 07:48:27 +01:00
Michael Niedermayer
6391dec82a Merge remote-tracking branch 'qatar/master'
* qatar/master:
  x86: dsputil: Simplify xvmc deprecation conditional

Conflicts:
	libavcodec/x86/dsputil_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-15 20:41:08 +01:00
Diego Biurrun
aab40bbfd5 x86: dsputil: Simplify xvmc deprecation conditional 2014-01-15 15:23:46 +01:00
Clément Bœsch
8b4190da93 vp9/x86: add AVX for itxfm and lpf.
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips

3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips

23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips

4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips

967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
2014-01-15 15:54:03 +01:00
Michael Niedermayer
cb613657ee avcodec/x86/proresdsp_init: x86 prores IDCT is bitexact again
reenable it for for bitexact mode

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 15:59:00 +01:00
Michael Niedermayer
b148a39d55 Merge commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7'
* commit '46bacb5cc6169ff5e8e982495c4925467c1d8bb7':
  x86: Consistently use cpu flag detection macros in places that still miss it

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 14:44:59 +01:00
Diego Biurrun
46bacb5cc6 x86: Consistently use cpu flag detection macros in places that still miss it 2014-01-14 00:04:58 +01:00
Clément Bœsch
af68bd1c06 vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3().
16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips
17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips

4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips
3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips

Overall decode time goes from:
  ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null -  8.10s user 0.02s system 99% cpu 8.126 total
to:
  ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null -  6.15s user 0.04s system 99% cpu 6.199 total

(46 to 61 fps)
2014-01-12 20:20:24 +01:00
Clément Bœsch
e11ceea68f vp9/x86: factor out some code in VP9_UNPACK_MULSUB_2W_4X. 2014-01-12 20:19:00 +01:00
Clément Bœsch
c9aa0b8f70 vp9/x86: remove reg redundancy in VP9_MULSUB_2W_2X. 2014-01-12 20:18:55 +01:00
Clément Bœsch
7c55ee6168 vp9/x86: merge IDCT coef macros. 2014-01-12 20:18:44 +01:00
Michael Niedermayer
92b2404571 Merge commit '4c642d8d98703faf52983243098f35865e15b312'
* commit '4c642d8d98703faf52983243098f35865e15b312':
  x86: hpeldsp: Add missing av_cold attribute to init function

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:32:53 +01:00
Michael Niedermayer
390452bab6 Merge commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b'
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
  x86: avcodec: Add a bunch of missing #includes for av_cold

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-09 20:24:15 +01:00
Diego Biurrun
4c642d8d98 x86: hpeldsp: Add missing av_cold attribute to init function 2014-01-09 15:09:07 +01:00
Diego Biurrun
b0be1ae792 x86: avcodec: Add a bunch of missing #includes for av_cold 2014-01-09 15:09:07 +01:00
Ronald S. Bultje
c6fe984f2f vp9/x86: make STORE_2X2 macro local.
Prevents this assembler warning:
libavcodec/x86/vp9itxfm.asm:1208: warning: (VP9_IDCT32_1D:309)
redefining multi-line macro `STORE_2X2'

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-08 14:07:15 +01:00
Ronald S. Bultje
04a187fb2a vp9/x86: idct_32x32_add_ssse3 sub-8x8-idct.
Runtime of the full 32x32 idct goes from 2446 to 2441 cycles (intra) or
from 1425 to 1306 cycles (inter). Overall runtime is not significantly
affected.
2014-01-07 20:43:35 -05:00
Ronald S. Bultje
37b001d14d vp9/x86: idct_32x32_add_ssse3 sub-16x16-idct.
Runtime of all IDCTs together goes from 3327 to 2473 cycles (intra, i.e.
~35% faster) or from 2312 to 1448 cycles (inter, i.e. ~60% faster). Total
decode time of ped1080p.webm goes from 8.086sec to 7.974sec (1.4% faster).
2014-01-07 20:43:34 -05:00
Ronald S. Bultje
e84d14df10 vp9/x86: idct_32x32_add_ssse3.
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
2014-01-07 20:43:30 -05:00
Michael Niedermayer
30056fd0be Merge commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb'
* commit 'a03a642d5ceb5f2f7c6ebbf56ff365dfbcdb65eb':
  h264: do not use 422 functions for monochrome

See: 07abf13da4
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-06 16:51:23 +01:00
Anton Khirnov
a03a642d5c h264: do not use 422 functions for monochrome
Fixes invalid memory access.

Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC:libav-stable@libav.org
2014-01-06 08:25:36 +01:00
Ronald S. Bultje
18175baa54 vp9/x86: 16px MC functions (64bit only).
Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0):
16x8:    876 ->   870  (0.7%)
16x16:  1444 ->  1435  (0.7%)
16x32:  2784 ->  2748  (1.3%)
32x16:  2455 ->  2349  (4.5%)
32x32:  4641 ->  4084 (13.6%)
32x64:  9200 ->  7834 (17.4%)
64x32:  8980 ->  7197 (24.8%)
64x64: 17330 -> 13796 (25.6%)
Total decoding time goes from 9.326sec to 9.182sec.
2013-12-26 21:05:10 -05:00
Ronald S. Bultje
0d9375fc90 vp9/x86: 16x16 sub-IDCT for top-left 8x8 subblock (eob <= 38).
Sub8x8 speed (w/o dc-only case) goes from ~750 cycles (inter) or ~735
cycles (intra) to ~415 cycles (inter) or ~430 cycles (intra). Average
overall 16x16 idct speed goes from ~635 cycles (inter) or ~720 cycles
(intra) to ~415 cycles (inter) or ~545 (intra) - all measurements done
using ped1080p.webm.
2013-12-26 07:40:25 -05:00
Ivan Kalvachev
1c63aed232 Convert XvMC to hwaccel v3
Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-22 22:03:47 +01:00
Michael Niedermayer
ce612fc186 Merge commit 'dfc50ac85e9d68a771b556297b7c411650206f3b'
* commit 'dfc50ac85e9d68a771b556297b7c411650206f3b':
  x86: mpegvideo: move denoise_dct asm to mpegvideoenc

Conflicts:
	libavcodec/x86/mpegvideo.c
	libavcodec/x86/mpegvideoenc.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-20 23:44:31 +01:00
Anton Khirnov
dfc50ac85e x86: mpegvideo: move denoise_dct asm to mpegvideoenc
This function is encoding-only.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2013-12-20 17:16:11 +01:00
Ronald S. Bultje
8d4c616fc0 vp9/x86: idct_add_16x16_ssse3.
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
2013-12-14 12:13:26 -05:00
Michael Niedermayer
8e70fdab36 Merge commit '4958f35a2ebc307049ff2104ffb944f5f457feb3'
* commit '4958f35a2ebc307049ff2104ffb944f5f457feb3':
  dsputil: Move apply_window_int16 to ac3dsp

Conflicts:
	libavcodec/arm/ac3dsp_init_arm.c
	libavcodec/arm/ac3dsp_neon.S
	libavcodec/x86/ac3dsp_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-09 04:12:40 +01:00
Diego Biurrun
4958f35a2e dsputil: Move apply_window_int16 to ac3dsp
The (optimized) functions are used nowhere else.
2013-12-08 17:57:15 +01:00
Ronald S. Bultje
92436e8ad9 vp9: implement top/left half (4x4) sub-8x8-IDCT.
For that specific case (eob>3&&eob<=12), runtime of idct8x8 goes from
668 to 477 cycles. For all idct8x8, runtime goes from 521 to 490 cycles.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
b2045c44a9 vp9: split pre-load of 11585x2 out of 1d idct macro.
This allows us to load it only once, instead of twice, in this function.
2013-12-07 12:39:36 -05:00
Ronald S. Bultje
f9a0d4c6e0 vp9: minor refactorings in idct ssse3 assembly.
Make register usage in macros explicit; change mulsub_2w_4x to use 2
instead of 3 temp registers.
2013-12-07 12:39:35 -05:00
Ronald S. Bultje
8729964b99 vp9: split x86 assembly in two files.
(And in future, loopfilter or intra pred could be put in their own
respective files also.)
2013-12-07 12:39:35 -05:00
Michael Niedermayer
5b4d57455d Merge remote-tracking branch 'qatar/master'
* qatar/master:
  x86: Initialize mmxext after amd3dnow optimizations

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-05 11:55:41 +01:00
Diego Biurrun
3d7c84747d x86: Initialize mmxext after amd3dnow optimizations
The mmxext optimizations should be at least equally fast if available and
amd3dnow optimizations are being deprecated. Thus the former should
override the latter, not the other way around.
2013-12-04 18:52:48 +01:00
Michael Niedermayer
be2312aa8f Merge remote-tracking branch 'qatar/master'
* qatar/master:
  dsputil: x86: Move ff_inv_zigzag_direct16 table init to mpegvideo

If someone optimizes dct_quantize for non x86 SIMD, then this
probably needs to be reverted.

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-02 10:59:48 +01:00
Diego Biurrun
7ffaa19570 dsputil: x86: Move ff_inv_zigzag_direct16 table init to mpegvideo
The table is MMX-specific and used nowhere else.
2013-12-02 04:05:18 +01:00
Michael Niedermayer
3adb825650 Merge commit 'cf7860db608df7c76471d8b61f07abbd5aad8dd5'
* commit 'cf7860db608df7c76471d8b61f07abbd5aad8dd5':
  x86: dsputil: Suppress deprecation warnings for XvMC bits

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-28 22:47:37 +01:00
Diego Biurrun
cf7860db60 x86: dsputil: Suppress deprecation warnings for XvMC bits
These parts are scheduled for removal on the next version bump.

Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
2013-11-28 16:04:30 +01:00
Clément Bœsch
616da59542 avcodec/x86/vp9dsp: merge a few SWAP together. 2013-11-21 23:06:21 +01:00
Clément Bœsch
e0434cfcfc avcodec/x86: remove 3 sub in pred4x4_tm_vp8_8.
before:
  411 decicycles in ff_pred4x4_tm_vp8_8_ssse3, 8388289 runs, 319 skips

after:
  389 decicycles in ff_pred4x4_tm_vp8_8_ssse3, 8388308 runs, 300 skips

Tested on i7 920.
2013-11-17 23:12:35 +01:00
Clément Bœsch
d28c79b003 avcodec/x86/vp9dsp: use EXTERNAL_* macros.
Original fix by one of these developers:
    Anton Khirnov <anton@khirnov.net>
    Diego Biurrun <diego@biurrun.de>
    Luca Barbato <lu_zero@gentoo.org>
    Martin Storsjö <martin@martin.st>

See 97962b2 / 72ca830

Personnal guess is Diego Biurrun.
2013-11-16 17:03:17 +01:00
Michael Niedermayer
91e00c4a78 Merge commit '458446acfa1441d283dacf9e6e545beb083b8bb0'
* commit '458446acfa1441d283dacf9e6e545beb083b8bb0':
  lavc: Edge emulation with dst/src linesize

Conflicts:
	libavcodec/cavs.c
	libavcodec/h264.c
	libavcodec/hevc.c
	libavcodec/mpegvideo_enc.c
	libavcodec/mpegvideo_motion.c
	libavcodec/rv34.c
	libavcodec/svq3.c
	libavcodec/vc1dec.c
	libavcodec/videodsp.h
	libavcodec/videodsp_template.c
	libavcodec/vp3.c
	libavcodec/vp8.c
	libavcodec/wmv2.c
	libavcodec/x86/videodsp.asm
	libavcodec/x86/videodsp_init.c

Changes to the asm are not merged, they are left for volunteers or
in their absence for later.
The changes this merge introduces are reordering of the function
arguments

See: face578d56
Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-15 15:07:10 +01:00
Ronald S. Bultje
72ca830f51 lavc: VP9 decoder
Originally written by Ronald S. Bultje <rsbultje@gmail.com> and
Clément Bœsch <u@pkh.me>

Further contributions by:
Anton Khirnov <anton@khirnov.net>
Diego Biurrun <diego@biurrun.de>
Luca Barbato <lu_zero@gentoo.org>
Martin Storsjö <martin@martin.st>

Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2013-11-15 10:16:28 +01:00
Ronald S. Bultje
458446acfa lavc: Edge emulation with dst/src linesize
Allow supporting files for which the image stride is smaller than
the maximum block size + number of subpel mc taps, e.g. a 64x64 VP9
file or a 16x16 VP8 file with -fflags +emu_edge.
2013-11-15 10:16:27 +01:00
Michael Niedermayer
5231eecdaf Merge remote-tracking branch 'qatar/master'
* qatar/master:
  Deprecate obsolete XvMC hardware decoding support

Conflicts:
	libavcodec/mpeg12.c
	libavcodec/mpeg12dec.c
	libavcodec/mpegvideo.c
	libavcodec/options_table.h
	libavutil/pixdesc.c
	libavutil/version.h

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-14 03:26:35 +01:00
Diego Biurrun
19e30a58fc Deprecate obsolete XvMC hardware decoding support
XvMC has long ago been superseded by newer acceleration APIs, such as
VDPAU, and few downstreams still support it. Furthermore XvMC is not
implemented within the hwaccel framework, but requires its own specific
code in the MPEG-1/2 decoder, which is a maintenance burden.
2013-11-13 21:07:45 +01:00
Michael Niedermayer
a30f7918b5 Merge commit '0338c396987c82b41d322630ea9712fe5f9561d6'
* commit '0338c396987c82b41d322630ea9712fe5f9561d6':
  dsputil: Split off H.263 bits into their own H263DSPContext

Conflicts:
	configure
	libavcodec/mpegvideo.h
	libavcodec/mpegvideo_enc.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-08 17:42:56 +01:00
Diego Biurrun
0338c39698 dsputil: Split off H.263 bits into their own H263DSPContext 2013-11-08 12:40:47 +01:00
Clément Bœsch
87434cf373 avcodec/vp9: add ff_vp9_idct_idct_{4x4,8x8}_ssse3().
1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips
1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips
1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips

529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips
516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips
474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips

(~3.9x faster)

7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips
7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips
7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips

1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips
1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips
1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips

(~7.1x faster)

Overall decode time before commit:
  16.48s user 0.03s system 99% cpu 16.526 total
  16.54s user 0.01s system 99% cpu 16.566 total
  16.46s user 0.03s system 99% cpu 16.511 total

Overall decode time after commit:
  16.34s user 0.02s system 99% cpu 16.378 total
  16.28s user 0.02s system 99% cpu 16.315 total
  16.32s user 0.03s system 99% cpu 16.366 total

Tested on i7 920 with 40s 1080p footage.
2013-11-05 19:25:40 +01:00
Michael Niedermayer
934e489ee8 Merge commit 'e2b5b097898c9155f4bdff4d83cdc54d5eef6930'
* commit 'e2b5b097898c9155f4bdff4d83cdc54d5eef6930':
  x86: rv40dsp: Use PAVGB instruction macro where appropriate

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-05 10:26:07 +01:00
Diego Biurrun
e2b5b09789 x86: rv40dsp: Use PAVGB instruction macro where appropriate 2013-11-04 21:14:39 +01:00
Mikulas Patocka
694d997afe x86: hpeldsp: Use PAVGB instruction macro where necessary
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2013-11-04 01:29:23 +01:00
Mikulas Patocka
074155360d avcodec/x86/hpeldsp: fix crash on AMD K6-3+
There are instructions pavgb and pavgusb. Both instructions do the same
operation but they have different enconding. Pavgb exists in SSE (or
MMXEXT) instruction set and pavgusb exists in 3D-NOW instruction set.

livavcodec uses the macro PAVGB to select the proper instruction. However,
the function avg_pixels8_xy2 doesn't use this macro, it uses pavgb
directly.

As a consequence, the function avg_pixels8_xy2 crashes on AMD K6-2 and
K6-3 processors, because they have pavgusb, but not pavgb.

This bug seems to be introduced by commit
71155d7b41, "dsputil: x86: Convert mpeg4
qpel and dsputil avg to yasm"

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-03 19:49:11 +01:00
Michael Niedermayer
7146eacfc5 Merge commit '1700b4e678ed329611a16b20d11e64b7abda4839'
* commit '1700b4e678ed329611a16b20d11e64b7abda4839':
  x86: vp8dsp: Split loopfilter code into a separate file

Conflicts:
	libavcodec/x86/Makefile

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-11-02 10:13:14 +01:00
Diego Biurrun
1700b4e678 x86: vp8dsp: Split loopfilter code into a separate file 2013-11-01 22:05:20 +01:00
Michael Niedermayer
fa6fa2162b avcodec/cabac: support UNCHECKED_BITSTREAM_READER = 0
Fixes overreads in HEVC
Fixes Ticket3070
Also fixed remaining issues from Ticket3075 and Ticket3076

Some lines of code taken from  0c5f839693da2276c2da23400f67a67be4ea0af1:libavcodec/x86/cabac.h
and                            0c5f839693da2276c2da23400f67a67be4ea0af1:libavcodec/cabac_functions.h

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-31 11:13:27 +01:00
Ronald S. Bultje
960490c0b2 avcodec/x86/videodsp: Small speedups in ff_emulated_edge_mc x86 SIMD.
Don't use word-size multiplications if size == 2, and if we're using
SIMD instructions (size >= 8), complete leftover 4byte sets using movd,
not mov. Both of these changes lead to minor speedups.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-27 15:02:48 +01:00
Ronald S. Bultje
cd86eb265f avcodec/x86/videodsp: fix a bug in a %if statement where we used '%%' instead of '&&'.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-27 15:02:48 +01:00
Michael Niedermayer
41efb8d9a7 avcodec/x86/cabac: include get_cabac_bypass_sign_x86() under #if !BROKEN_COMPILER
this might fix Ticket2999 as well as some fate clients
untested as the original patch submitter no longer has the environment to test
this should be reverted if it does not fix the issues

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-26 15:06:55 +02:00
Ronald S. Bultje
1b3a7e1f42 avcodec/x86/videodsp: Properly mark sse2 instructions in emulated_edge_mc x86 simd as such.
Should fix crashes or corrupt output on pre-SSE2 CPUs when they were
using SSE2-code (e.g. AMD Athlon XP 2400+ or Intel Pentium III) in
hfix or hvar single-edge (left/right) extension functions.

Tested-by: Ingo Brückl <ib@wupperonline.de>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-24 13:36:55 +02:00
Michael Niedermayer
c35d29a9c8 avcodec/x86/dsputil_init: move ff_idct_xvid_mmxext init
This decreases the diff to libav

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-15 02:06:12 +02:00
Michael Niedermayer
ab8cbfe0dd avcodec/x86/dsputil_init: remove duplicated sse2 idct init
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-15 01:59:36 +02:00
Michael Niedermayer
1bf8fa75ee avcodec/x86/dsputil_init: fix cpu flag checks
Fixes linking failure with --disable-sse2

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-15 01:46:21 +02:00
Ronald S. Bultje
20d78a8606 libavcodec/x86: Fix emulated_edge_mc SSE code to not contain SSE2 instructions on x86-32.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-10 13:36:06 +02:00
Ronald S. Bultje
ad75d2b590 x86: Fix compilation with nasm on PPC & OS/2
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-08 12:36:19 +02:00
Michael Niedermayer
deb5addcff Merge remote-tracking branch 'qatar/master'
* qatar/master:
  x86: h264_idct: Update comments to match 8/10-bit depth optimization split

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-08 12:10:02 +02:00