This removes all references to AVCodecContext.dsp_mask and marks
it for eviction at the next version bump. It has been superseded
by av_set_cpu_flag_mask() which, unlike this field, works everywhere.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Recent register allocation changes (x86inc.asm update) changed the
register order and thus opcodes for the inner loops. One of them became
>128bytes, which confuses other parts of this function where it jumps
to fixed-offset positions to extend the edge by fixed amounts. A simple
register change fixes this.
Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments
Also (by Ronald S. Bultje)
Fix up our asm to work with new x86inc.asm.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
Quite often, the original weights are multiple of 512. By prescaling them
by 1/512 when they are computed (once per frame), no intermediate shifting
is needed, and no prescaling on each call either.
The x86 code already used that trick.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Prevents a signflip in the counter, and a subsequent crash because of
overreads/overwrites.
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
Since the values are floats, using the float operations
makes sense, improves performance on some CPUs and
makes the code SSE compatible instead of needing SSE2.
Based on suggestion by Jason.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
There is only one caller, which does not need the shifting. Other use cases
are situations where different roundings would be needed.
The x86 and neon versions are modified accordingly.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
movq from SSE register _to_ memory is an SSE2 instruction.
Use the SSE movlps function instead that does the same thing.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This splits ff_dsputil_init_mmx() into multiple functions, one for
each MMX/SSE level, somewhat simplifying the nested conditions.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.
Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.
Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C /32bits: 82c (unrolled)/102c
C /64bits: 69c (unrolled)/82c
SSE/32bits: 42c
SSE/64bits: 31c
Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This prevents having to sign-extend on 64-bit systems with 32-bit ints,
such as x86-64. Also fixes crashes on systems where we don't do it and
arguments are not in registers, such as Win64 for all weight functions.
By replacing memcpy with an unrolled loop using the alignment knowledge
it has, some speedup can be obtained.
Before (gcc 4.6.1): ~400 cycles
After: ~370 cycles
Overall, around 2% speed increase when decoding a 2400s mp3 to f32le.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
We need to do unsigned saturation in order to cover the corner case when the
absolute coefficient value is 16777215 (the maximum value).
Fixes Bug #216
This will be useful to test more aggressively for failures to mark XMM
registers as clobbered in Win64 builds, and prevent regressions thereof.
Based on a patch by Ramiro Polla <ramiro.polla@gmail.com>
Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are
multiples of 512 (which is often the case when the values round up nicely).
*_TIMER report for the 16x16 and 8x8 cases:
C:
9015 decicycles in 16, 524257 runs, 31 skips
2656 decicycles in 8, 524271 runs, 17 skips
MMX:
4156 decicycles in 16, 262090 runs, 54 skips
1206 decicycles in 8, 262131 runs, 13 skips
MMX on fast-path:
2760 decicycles in 16, 524222 runs, 66 skips
995 decicycles in 8, 524252 runs, 36 skips
SSE2:
2163 decicycles in 16, 262131 runs, 13 skips
832 decicycles in 8, 262137 runs, 7 skips
SSE2 with fast path:
1783 decicycles in 16, 524276 runs, 12 skips
711 decicycles in 8, 524283 runs, 5 skips
SSSE3:
2117 decicycles in 16, 262136 runs, 8 skips
814 decicycles in 8, 262143 runs, 1 skips
SSSE3 with fast path:
1315 decicycles in 16, 524285 runs, 3 skips
578 decicycles in 8, 524286 runs, 2 skips
This means around a 4% speedup for some sequences.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
While pshufb allows emulating bswap on XMM registers for SSSE3, more
shuffling is needed for SSE2. Alignment is critical, so specific codepaths
are provided for this case.
For the huffyuv sequence "angels_480-huffyuvcompress.avi":
C (using bswap instruction): ~ 55k cycles
SSE2: ~ 40k cycles
SSSE3 using unaligned loads: ~ 35k cycles
SSSE3 using aligned loads: ~ 30k cycles
Signed-off-by: Diego Biurrun <diego@biurrun.de>
On x86-64, it indeed uses all 16 registers (and on x86-32, this gets
clipped to 8). Not marking it properly causes callers of this function
to fail randomly because of XMM register clobbering.
Extract processing of intra 16x16 blocks from intra macroblock
processing.
Also implement a function performing inverse transform and block
reconstruction for DC-only blocks in 1 pass instead of 2.
When decoding coefficients, detect whether the block is DC-only, and take
advantage of this knowledge to perform DC-only inverse transform.
This is achieved by:
- first, changing the 108x4 element modulo_three_table into a 108 element
table (kind of base4), and accessing each value using mask and shifts.
- then, checking low bits for 0 (as they represent the presence of higher
frequency coefficients)
Also provide x86 SIMD code for the DC-only inverse transform.
Signed-off-by: Kostya Shishkov <kostya.shishkov@gmail.com>
This is required to handle clobbering of XMM registers on Win64
correctly. Fixes FFT and all tests depending on FFT on Win64.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
Originally, prior to 8742a4ff8, the caller code was compiled
within this condition:
ARCH_X86 && HAVE_7REGS && HAVE_EBX_AVAILABLE && !defined(BROKEN_RELOCATIONS)
Since HAVE_7REGS is defined as
(ARCH_X86_64 || (HAVE_EBX_AVAILABLE && HAVE_EBP_AVAILABLE))
the subcondition HAVE_7REGS && HAVE_EBX_AVAILABLE is equal
to HAVE_7REGS (for 32 bit at least). The correct simplification
of the original condition thus is HAVE_7REGS, not
HAVE_EBX_AVAILABLE.
This fixes compilation in some cases where HAVE_EBP_AVAILABLE = 0
and HAVE_EBX_AVAILABLE = 1.
Signed-off-by: Martin Storsjö <martin@martin.st>
On 32-bit OS X with gcc 4.0/4.2 and shared libraries enabled, the ebx register
is not available, but required to assemble the functions.
This reverts commit 8742a4f to a simplified version of the original constraints.
The change in 599b4c6ef didn't turn out to work properly on
i386 on OS X, where it broke building with PIC enabled.
Signed-off-by: Martin Storsjö <martin@martin.st>
This replaces the explicit offset(reg) memory references with
"m" operands for the same locations. As a result, one fewer
register operand is needed for these inline asm statements.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Inspection of compiled code shows gcc handles these fine on its own.
Benchmarking also shows no measurable speed difference.
Removing the remaining cases in get_cabac_bypass_sign_x86() does
cause more substantial changes to the compiled code with uncertain
impact.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The upcoming gcc 4.7 has more advanced constant propagation
resulting some inline asm operands becoming constants and thus
emitted as literals, sometimes in contexts where this results
in invalid instructions.
This patch changes the constraints of the relevant operands
to "rm" thus forcing a valid type. While obviously suboptimal,
this is what older gcc versions already did, and there is no
change to the code generated with these.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This macro can cause problems in conjunction with the bitdepth
template expansion. It was presumably added to keep source
compatibility when high bitdepth support was added. However,
emulated_edge_mc is a dsputil pointer and should not be called
directly, so there is little reason to keep such a macro.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Some operands need to be accessed in byte mode, which restricts the
available registers in 32-bit mode. Using the 'q' constraint selects
a suitable register.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Parts are inspired from the 8-bit H.264 predict code in Libav.
Other parts ported from x264 with relicensing permission from author.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Fixes regression in 836f47d34b in ICC-10.x,
since ICC<=11.0 doesn't align stack upon function calls.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Ports the majority of IDCT functions for 10-bit H.264.
Parts are inspired from 8-bit IDCT code in Libav; other parts ported from x264 with relicensing permission from author.
Signed-off-by: Ronald S. Bultje <rbultje@google.com>
This separation allows these functions to be used in a cleaner
fashion from other codecs (e.g. qdm2) and simplifies creating
optimised versions of them.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Arguments for variable size instructions are added to many macros, along
with other various changes. The x86util.asm code was ported from x264.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
This patch lets e.g. dsputil_init chose dsp functions with respect to
the bit depth to decode. The naming scheme of bit depth dependent
functions is <base name>_<bit depth>[_<prefix>] (i.e. the old
clear_blocks_c is now named clear_blocks_8_c).
Note: Some of the functions for high bit depth is not dependent on the
bit depth, but only on the pixel size. This leaves some room for
optimizing binary size.
Preparatory patch for high bit depth h264 decoding support.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
If the function is not inlined, an immmediate cannot be used for the
shift parameter, so the %cl register must be used instead in that case.
This fixes compilation for x86-32 using gcc with --disable-optimizations.
This fixes unexpected name collisions that were occurring with variables
declared within the macros.
It also fixes the fate-acodec-ac3_fixed regression test on x86-32.
AC3DSPContext.ac3_max_msb_abs_int16() finds the maximum MSB of the absolute
value of each element in an array of int16_t.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Fix emu_edge_v_extend_15 to be <128 bytes on Win64, by being more strict
on the size of registers and which registers are being used for operations
where multiple are available. This fixes segfaults in emulated_edge()
function calls on Win64.
This will be beneficial for use with the audio conversion API without
requiring it to depend on all of dsputil.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The original functions did not work correctly for edge pixels, e.g.
when CODEC_FLAG_EMU_EDGE is set, leading to corrupt output in e.g. VLC.
Based on a patch by Daniel Kang <daniel d kang gmail com>.
Signed-off-by: Ronald S. Bultje <rsbultje gmail com>
About 2.5x the speed.
NOTE: the way that the asm code handles large qmuls is a bit suboptimal.
If x264-style dequant was used (separate shift and qmul values), it might
be possible to get some extra speed.
Originally committed as revision 26336 to svn://svn.ffmpeg.org/ffmpeg/trunk
Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot d dot kang
at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26162 to svn://svn.ffmpeg.org/ffmpeg/trunk
Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot d dot kang
at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26159 to svn://svn.ffmpeg.org/ffmpeg/trunk
Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot d dot kang
at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26158 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors:Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26157 to svn://svn.ffmpeg.org/ffmpeg/trunk
Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot d dot kang
at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26156 to svn://svn.ffmpeg.org/ffmpeg/trunk
Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot d dot kang
at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26155 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors: Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26151 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors: Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26150 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors: Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26149 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors: Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26148 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors: Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26147 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors: Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26146 to svn://svn.ffmpeg.org/ffmpeg/trunk
(authors: Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot
d dot kang at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26145 to svn://svn.ffmpeg.org/ffmpeg/trunk
Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot d dot kang
at gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26143 to svn://svn.ffmpeg.org/ffmpeg/trunk
Jason, Loren, Holger) to FFmpeg. Patch by Daniel Kang <daniel dot d dot kang at
gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26142 to svn://svn.ffmpeg.org/ffmpeg/trunk
FFmpeg. Original authors: Holger Lubitz <holger lubitz org>, Jason Garrett-
Glaser <darkshikari gmail com> (approves LGPL relicensing for this code) and
Loren Merritt <lorenm at u dot washington dot edu> (approves LGPL relicensing
for this code). Patch by Daniel Kang <daniel dot d dot kang at gmail com>, as
part of Google's GCI 2010.
Originally committed as revision 26140 to svn://svn.ffmpeg.org/ffmpeg/trunk
FFmpeg. Original authors: Holger Lubitz <holger lubitz org>, Jason Garrett-
Glaser <darkshikari gmail com> (approves LGPL relicensing for this code) and
Loren Merritt <lorenm at u dot washington dot edu> (approves LGPL relicensing
for this code). Patch by Daniel Kang <daniel dot d dot kang at gmail com>, as
part of Google's GCI 2010.
Originally committed as revision 26139 to svn://svn.ffmpeg.org/ffmpeg/trunk
Original authors: Holger Lubitz <holger lubitz org>, Jason Garrett-Glaser
<darkshikari gmail com> (approves LGPL relicensing for this code) and Loren
Merritt <lorenm at u dot washington dot edu> (approves LGPL relicensing for
this code). Patch by Daniel Kang <daniel dot d dot kang at gmail com>, as
part of Google's GCI 2010.
Originally committed as revision 26138 to svn://svn.ffmpeg.org/ffmpeg/trunk
Original authors: Holger Lubitz <holger lubitz org>, Jason Garrett-Glaser
<darkshikari gmail com> (approves LGPL relicensing for this code) and Loren
Merritt <lorenm at u dot washington dot edu> (approves LGPL relicensing for
this code). Patch by Daniel Kang <daniel dot d dot kang at gmail com>, as
part of Google's GCI 2010.
Originally committed as revision 26137 to svn://svn.ffmpeg.org/ffmpeg/trunk
authors: Holger Lubitz <holger lubitz org>, Jason Garrett-Glaser <darkshikari
gmail com> (approves LGPL relicensing for this code) and Loren Merritt <lorenm
at u dot washington dot edu> (approves LGPL relicensing for this code). Patch
by Daniel Kang <daniel dot d dot kang at gmail com>, as part of Google's GCI
2010.
Originally committed as revision 26135 to svn://svn.ffmpeg.org/ffmpeg/trunk
Original authors: Holger Lubitz <holger lubitz org>, Jason Garrett-Glaser
<darkshikari gmail com> (approves LGPL relicensing for this code) and Loren
Merritt <lorenm at u dot washington dot edu> (approves LGPL relicensing for
this code). Patch by Daniel Kang <daniel dot d dot kang at gmail com>, as
part of Google's GCI 2010.
Originally committed as revision 26132 to svn://svn.ffmpeg.org/ffmpeg/trunk
initially said he'd be OK with relicensing, he also said he wanted to have
another look at the patch, and then he went on vacation, so let's play it
safe for now. We can consider removing this again later.
Originally committed as revision 26131 to svn://svn.ffmpeg.org/ffmpeg/trunk
LGPL relicensing approved by original authors: Holger Lubitz <holger lubitz
org>, Jason Garrett-Glaser <darkshikari gmail com> and Loren Merritt <lorenm
at u dot washington dot edu>. Patch by Daniel Kang <daniel dot d dot kang at
gmail com>, as part of Google's GCI 2010.
Originally committed as revision 26087 to svn://svn.ffmpeg.org/ffmpeg/trunk
This fixes compilation with the latest clang trunk version.
Patch by İsmail Dönmez, ismail at namtrac dot org
Originally committed as revision 25628 to svn://svn.ffmpeg.org/ffmpeg/trunk
These blocks depended on the compiler keeping xmm registers untouched between
them.
Originally committed as revision 25619 to svn://svn.ffmpeg.org/ffmpeg/trunk
suncc does not like the leading commas inside the macro, but it has no problem
with trailing commas.
Originally committed as revision 25615 to svn://svn.ffmpeg.org/ffmpeg/trunk
Some code was initializing some xmm registers in one asm block and using them
in the following block, assuming they wouldn't be changed in between blocks.
Originally committed as revision 25568 to svn://svn.ffmpeg.org/ffmpeg/trunk
prediction (plus some with different rounding for svq3/rv40). Speedup (for
SSSE3) about ~6-fold, 3.6% faster overall with cathedral sample.
Originally committed as revision 25361 to svn://svn.ffmpeg.org/ffmpeg/trunk
Fixes compilation with clang's builtin assembler
Patch by İsmail Dönmez, ismail at namtrac dot org
Originally committed as revision 25331 to svn://svn.ffmpeg.org/ffmpeg/trunk
inline asm works for gcc-3.x also (hopefully). Should fix gcc-3.x FATE
breakage after r25254.
Originally committed as revision 25262 to svn://svn.ffmpeg.org/ffmpeg/trunk
increase to e.g. vc1, snow and mpeg decoding.
Patch by Eli Friedman <eli dot friedman gmail com>.
Originally committed as revision 25259 to svn://svn.ffmpeg.org/ffmpeg/trunk
from memory locations/offsets depending on b_idx plus constants, rather than
having gcc do this. This saves several lea calls and together saves about
10 cycles in h264_loop_filter_strength_mmx2().
Originally committed as revision 25256 to svn://svn.ffmpeg.org/ffmpeg/trunk
a pxor, or remove the instruction alltogether. Altogether, this saves 1
instruction.
Originally committed as revision 25255 to svn://svn.ffmpeg.org/ffmpeg/trunk
This has no measurable speed effect because the surrounding code doesn't
take advantage of this yet.
Originally committed as revision 25254 to svn://svn.ffmpeg.org/ffmpeg/trunk
of the d_idx variable and therefore allows for future optimizations. No speed
difference by this commit itself.
Originally committed as revision 25253 to svn://svn.ffmpeg.org/ffmpeg/trunk
inlining various constants within the loop code. 20 cycles faster on
cathedral sample.
Originally committed as revision 25252 to svn://svn.ffmpeg.org/ffmpeg/trunk
inlines scan8[] and removes loop setup. 15% faster, 0.4% overall.
See "[PATCH] unroll loop in h264_idct_add8_sse2()" thread on ML.
Originally committed as revision 25172 to svn://svn.ffmpeg.org/ffmpeg/trunk
code directly also and remove loop setup. 20% faster in function, 0.8% overall.
See "[PATCH] unroll loop in h264_idct_add8_sse2()" thread on ML.
Originally committed as revision 25171 to svn://svn.ffmpeg.org/ffmpeg/trunk
which will hopefully solve the Win64/FATE failures caused by these functions.
Originally committed as revision 25137 to svn://svn.ffmpeg.org/ffmpeg/trunk
h264dsp_mmx.c to h264_idct.asm (as yasm code). Because the loops are now
coded in asm instead of C, this is (depending on the function) up to 50%
faster for cases where gcc didn't do a great job at looping.
Since h264_idct_add8() is now faster than the manual loop setup in h264.c,
in-asm idct calling can now be enabled for chroma as well (see r16207). For
MMX, this is 5% faster. For SSE2 (which isn't done for chroma if h264.c does
the looping), this makes it up to 50% faster. Speed gain overall is ~0.5-1.0%.
Originally committed as revision 25119 to svn://svn.ffmpeg.org/ffmpeg/trunk
This increases compatibilty with nasm and is also more consistent,
e.g. with h264_intrapred.asm and h264_chromamc.asm that already
do it that way.
Originally committed as revision 25042 to svn://svn.ffmpeg.org/ffmpeg/trunk