578 Commits

Author SHA1 Message Date
Ronald S. Bultje
b4188f0d46 vp8: convert simple loopfilter x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
8476ca3b4e vp8: convert idct x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
21ffc78fd7 vp8: convert mc x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
28170f1a39 vp8: convert loopfilter x86 assembly to use cpuflags(). 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
e25be47154 vp8: convert idct/mc x86 assembly to use cpuflags(). 2012-03-03 20:39:59 -08:00
Ronald S. Bultje
291c9b6285 h264: change underread for 10bit QPEL to overread.
This prevents us from reading before the start of the buffer, and thus
prevents crashes resulting from this behaviour. Fixes bug 237.
2012-03-02 10:33:05 -08:00
Ronald S. Bultje
45549339bc vp8: disable mmx functions with sse/sse2 counterparts on x86-64.
x86-64 is guaranteed to have at least SSE2, therefore the MMX/MMX2
functions will never be used in practice.
2012-03-02 10:32:05 -08:00
Ronald S. Bultje
bd66f073fe vp8: change int stride to ptrdiff_t stride.
On 64bit platforms with 32bit int, this means we won't have to sign-
extend the integer anymore.
2012-03-02 10:31:50 -08:00
Ronald S. Bultje
b0c4f04338 h264: fix mmxext chroma deblock to use correct TC values. 2012-02-27 09:38:44 -08:00
Christophe GISQUET
2784d18791 SBR DSP x86: implement SSE sbr_hf_g_filt
Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.

Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.

Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-23 15:50:09 -08:00
Christophe GISQUET
34454c761f SBR DSP x86: implement SSE sbr_sum_square_sse
The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C  /32bits: 82c (unrolled)/102c
               C  /64bits: 69c (unrolled)/82c
               SSE/32bits: 42c
               SSE/64bits: 31c

Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-23 15:50:06 -08:00
Ronald S. Bultje
3ab9a2a557 rv34: change most "int stride" into "ptrdiff_t stride".
This prevents having to sign-extend on 64-bit systems with 32-bit ints,
such as x86-64. Also fixes crashes on systems where we don't do it and
arguments are not in registers, such as Win64 for all weight functions.
2012-02-20 14:58:25 -08:00
Ronald S. Bultje
8fb26950ed h264: don't use redzone in loopfilter on win64.
Red zone usage is not allowed in the Win64 ABI.
2012-02-19 15:31:03 -08:00
Christophe GISQUET
f3e084909b mpegaudio: replace memcpy by SIMD code
By replacing memcpy with an unrolled loop using the alignment knowledge
it has, some speedup can be obtained.

Before (gcc 4.6.1): ~400 cycles
After: ~370 cycles

Overall, around 2% speed increase when decoding a 2400s mp3 to f32le.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-15 20:11:54 -08:00
Martin Storsjö
efd29844eb mpegvideo: Add ff_ prefix to nonstatic functions
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-02-15 22:07:23 +02:00
Martin Storsjö
873c89e2a6 dsputil: Add ff_ prefix to inv_zigzag_direct16
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-02-15 22:06:42 +02:00
Martin Storsjö
9cf0841ef3 dsputil: Add ff_ prefix to the dsputil*_init* functions
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-02-15 22:06:34 +02:00
Justin Ruggles
d483bb58c3 ac3dsp: do not use pshufb in ac3_extract_exponents_ssse3()
We need to do unsigned saturation in order to cover the corner case when the
absolute coefficient value is 16777215 (the maximum value).

Fixes Bug #216
2012-02-09 21:04:44 -05:00
Diego Biurrun
0bba26466f cosmetics: Delete empty lines at end of file. 2012-02-09 12:26:45 +01:00
Ronald S. Bultje
ce1e250ee9 h264: manually save/restore XMM registers for functions using INIT_MMX.
On Win64, these registers are callee-save, so not saving/restoring them
correctly is a violation of ABI and can lead to crashes or corrupt data.
2012-02-08 10:31:14 -08:00
Ronald S. Bultje
4ff6dea390 pngdsp: swap argument inversion. 2012-02-07 14:32:26 -08:00
Michael Kostylev
3206cccc0e h264: mark h264_idct_add8_10 with number of XMM registers.
This fixes XMM register clobber problems on Win64.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-07 11:37:13 -08:00
Ronald S. Bultje
7e4d9d5d45 win64: add a XMM clobber test configure option.
This will be useful to test more aggressively for failures to mark XMM
registers as clobbered in Win64 builds, and prevent regressions thereof.

Based on a patch by Ramiro Polla <ramiro.polla@gmail.com>
2012-02-02 12:00:48 -08:00
Justin Ruggles
236a550c3f Fix a typo in the x86 asm version of ff_vector_clip_int32()
Specifies the correct number of xmm registers used so that they can be saved
and restored on Win64 if necessary.
2012-02-01 19:02:32 -05:00
Christophe Gisquet
e5c9de2ab7 rv40: x86 SIMD for biweight
Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are
multiples of 512 (which is often the case when the values round up nicely).

*_TIMER report for the 16x16 and 8x8 cases:
C:
9015 decicycles in 16, 524257 runs, 31 skips
2656 decicycles in 8, 524271 runs, 17 skips
MMX:
4156 decicycles in 16, 262090 runs, 54 skips
1206 decicycles in 8, 262131 runs, 13 skips
MMX on fast-path:
2760 decicycles in 16, 524222 runs, 66 skips
995 decicycles in 8, 524252 runs, 36 skips
SSE2:
2163 decicycles in 16, 262131 runs, 13 skips
832 decicycles in 8, 262137 runs, 7 skips
SSE2 with fast path:
1783 decicycles in 16, 524276 runs, 12 skips
711 decicycles in 8, 524283 runs, 5 skips
SSSE3:
2117 decicycles in 16, 262136 runs, 8 skips
814 decicycles in 8, 262143 runs, 1 skips
SSSE3 with fast path:
1315 decicycles in 16, 524285 runs, 3 skips
578 decicycles in 8, 524286 runs, 2 skips

This means around a 4% speedup for some sequences.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-01-30 23:58:25 +01:00
Diego Biurrun
91bafb52ae x86: Give RV40 init file a more suitable name. 2012-01-30 23:58:24 +01:00
Diego Biurrun
c30b198381 x86: Place mm_flags variable declaration below the appropriate #ifdef.
This fixes some unused variable warnings with YASM disabled.
2012-01-30 23:58:23 +01:00
Christophe Gisquet
6b03900382 x86 dsputil: provide SSE2/SSSE3 versions of bswap_buf
While pshufb allows emulating bswap on XMM registers for SSSE3, more
shuffling is needed for SSE2. Alignment is critical, so specific codepaths
are provided for this case.

For the huffyuv sequence "angels_480-huffyuvcompress.avi":
C (using bswap instruction): ~ 55k cycles
SSE2:                        ~ 40k cycles
SSSE3 using unaligned loads: ~ 35k cycles
SSSE3 using aligned loads:   ~ 30k cycles

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-01-30 10:19:55 +01:00
Ronald S. Bultje
af79a0c48a png: add support for bpp>4 to paeth x86 SIMD code.
This fixes playback of e.g. RGB48 (bpp=6) content on x86 CPUs. Fixes
bug 214.
2012-01-29 21:22:50 -08:00
Ronald S. Bultje
f91c4b7824 png: add SSE2 version for add_bytes_l2. 2012-01-29 18:52:17 -08:00
Ronald S. Bultje
59f474b49d png: convert DSP functions to yasm. 2012-01-29 18:47:50 -08:00
Ronald S. Bultje
20a7d3178f png: add missing #if HAVE_SSSE3 around function pointer assignment. 2012-01-29 12:31:59 -08:00
Ronald S. Bultje
331e7c4cb3 imdct36: mark SSE functions as using all 16 XMM registers.
On x86-64, it indeed uses all 16 registers (and on x86-32, this gets
clipped to 8). Not marking it properly causes callers of this function
to fail randomly because of XMM register clobbering.
2012-01-29 08:14:05 -08:00
Ronald S. Bultje
e92003514d png: move DSP functions to their own DSP context. 2012-01-29 08:11:18 -08:00
Ronald S. Bultje
3b15a6d742 config.asm: change %ifdef directives to %if directives.
This allows combining multiple conditionals in a single statement.
2012-01-27 10:19:57 +08:00
Ronald S. Bultje
c3af52fa8b dsputil: use vertical component for drawing bottom edge.
Current code only writes 8 pixels of vertical edge for YUV422, which
causes MC artifacts when subsequent frames use data from that edge.
2012-01-25 18:06:36 +08:00
Christophe GISQUET
9ba9c34024 rv34: 1-pass inter MB reconstruction
Implement 1-pass inverse transform and reconstruction for inter blocks.
2012-01-16 19:26:41 +01:00
Christophe GISQUET
d78062386e rv34: Intra 16x16 handling
Extract processing of intra 16x16 blocks from intra macroblock
processing.
Also implement a function performing inverse transform and block
reconstruction for DC-only blocks in 1 pass instead of 2.
2012-01-16 00:41:51 +01:00
Christophe GISQUET
3faa303a47 rv34: DC-only inverse transform
When decoding coefficients, detect whether the block is DC-only, and take
advantage of this knowledge to perform DC-only inverse transform.

This is achieved by:
- first, changing the 108x4 element modulo_three_table into a 108 element
  table (kind of base4), and accessing each value using mask and shifts.
- then, checking low bits for 0 (as they represent the presence of higher
  frequency coefficients)

Also provide x86 SIMD code for the DC-only inverse transform.

Signed-off-by: Kostya Shishkov <kostya.shishkov@gmail.com>
2012-01-12 09:52:33 +01:00
Henrik Gramner
e7d02b04dc fft: init functions with INIT_XMM/YMM.
This is required to handle clobbering of XMM registers on Win64
correctly. Fixes FFT and all tests depending on FFT on Win64.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2012-01-11 20:12:26 +01:00
Vitor Sessak
39df0c434c mpegaudiodec: optimized iMDCT transform
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-01-08 17:40:55 -08:00
Martin Storsjö
676a9ee1d2 x86: Fix constraints for decode_significance*_x86
Originally, prior to 8742a4ff8, the caller code was compiled
within this condition:

ARCH_X86 && HAVE_7REGS && HAVE_EBX_AVAILABLE && !defined(BROKEN_RELOCATIONS)

Since HAVE_7REGS is defined as
(ARCH_X86_64 || (HAVE_EBX_AVAILABLE && HAVE_EBP_AVAILABLE))
the subcondition HAVE_7REGS && HAVE_EBX_AVAILABLE is equal
to HAVE_7REGS (for 32 bit at least). The correct simplification
of the original condition thus is HAVE_7REGS, not
HAVE_EBX_AVAILABLE.

This fixes compilation in some cases where HAVE_EBP_AVAILABLE = 0
and HAVE_EBX_AVAILABLE = 1.

Signed-off-by: Martin Storsjö <martin@martin.st>
2011-12-27 09:05:14 +02:00
Diego Biurrun
6fdb2ce34a x86: Tighten register constraints for decode_significance*_x86.
On 32-bit OS X with gcc 4.0/4.2 and shared libraries enabled, the ebx register
is not available, but required to assemble the functions.

This reverts commit 8742a4f to a simplified version of the original constraints.
2011-12-21 12:06:37 +01:00
Diego Biurrun
30bbd5cbc0 x86: conditionally compile dnxhd encoder optimizations 2011-12-19 13:54:10 +01:00
Diego Biurrun
88b9735753 build: conditionally compile x86 H.264 chroma optimizations 2011-12-14 11:58:45 +01:00
Martin Storsjö
f1dba9e498 x86: Require 7 registers for the cabac asm
The change in 599b4c6ef didn't turn out to work properly on
i386 on OS X, where it broke building with PIC enabled.

Signed-off-by: Martin Storsjö <martin@martin.st>
2011-12-12 15:36:20 +02:00
Mans Rullgard
599b4c6efd x86: cabac: replace explicit memory references with "m" operands
This replaces the explicit offset(reg) memory references with
"m" operands for the same locations.  As a result, one fewer
register operand is needed for these inline asm statements.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-12-11 22:29:22 +00:00
Diego Biurrun
da9cea77e3 Fix a bunch of common typos. 2011-12-11 00:32:25 +01:00
Justin Ruggles
0e8fdd41c2 dsputil: use cpuflags in x86 emu_edge_core
avoids passing around the extra argument among all the macros it uses
2011-11-22 15:40:51 -05:00
Justin Ruggles
395f2e70dd dsputil: use movups instead of movdqu in ff_emu_edge_core_sse()
This allows emulated_edge_mc_sse() and gmc_sse() to be used under
AV_CPU_FLAG_SSE.
2011-11-22 15:40:51 -05:00