Lets us do the zeroing in asm instead of C.
Also makes it consistent with the way the regular iDCT code does it.
Originally committed as revision 24668 to svn://svn.ffmpeg.org/ffmpeg/trunk
unchanged bytes) in the horizontal simple loopfilter. This makes the filter
quite a bit faster in itself (~30 cycles less on Core1), probably mostly
because we don't need a complex 4x4 transpose, but only a simple byte
interleave. Also allows using pextrw on SSE4, which speeds up even more
(e.g. 25% faster on Core i7).
Originally committed as revision 24638 to svn://svn.ffmpeg.org/ffmpeg/trunk
5-10% faster or more on Phenom, Athlon 64, and some others.
Helps some on pre-SSSE3 Intel chips as well, but not as much.
Originally committed as revision 24513 to svn://svn.ffmpeg.org/ffmpeg/trunk
mbedge loopfilter functions, by re-using space that holds a variable
that we no longer need.
Originally committed as revision 24510 to svn://svn.ffmpeg.org/ffmpeg/trunk
future new optimizations (imagine a sse5) much easier. Also fix a bug where
we used the direction (%2) rather than optimization (%1) to enable this, which
means it wasn't ever actually used...
Originally committed as revision 24507 to svn://svn.ffmpeg.org/ffmpeg/trunk
splits it into small optimization-specific macros which are selected for each
DSP function. The advantage of this approach is that the sse4 functions now
use the ssse3 codepath also without needing an explicit sse4 codepath.
Originally committed as revision 24487 to svn://svn.ffmpeg.org/ffmpeg/trunk
This is a lot more reliable to get cmov rather than trying to trick gcc into
generating it, useful since it's 2% faster overall.
Patch by Eli Friedman <eli.friedman at gmail>
Originally committed as revision 24471 to svn://svn.ffmpeg.org/ffmpeg/trunk
Take shortcuts based on statistically common situations.
Add 4-at-a-time idct_dc function (mmx and sse2) since rows of 4 DC-only DCT
blocks are common.
TODO: tie this more directly into the MB mode, since the DC-level transform is
only used for non-splitmv blocks?
Originally committed as revision 24452 to svn://svn.ffmpeg.org/ffmpeg/trunk
SSSE3 versions, improve SSE2 versions a bit.
SSE2/SSSE3 mbedge h functions are currently broken, so explicitly disable them.
Originally committed as revision 24403 to svn://svn.ffmpeg.org/ffmpeg/trunk
Avoid pextrw, since it's slow on many older CPUs.
Now it doesn't require mmxext either.
Originally committed as revision 24397 to svn://svn.ffmpeg.org/ffmpeg/trunk
Should fix compilation with icc and should help prevent any future duplicates
Originally committed as revision 24380 to svn://svn.ffmpeg.org/ffmpeg/trunk
regular MMX code. Examples of this are the Core1 CPU. Instead, set a new flag,
FF_MM_SSE2/3SLOW, which can be checked for particular SSE2/3 functions that
have been checked specifically on such CPUs and are actually faster than
their MMX counterparts.
In addition, use this flag to enable particular VP8 and LPC SSE2 functions
that are faster than their MMX counterparts.
Based on a patch by Loren Merritt <lorenm AT u washington edu>.
Originally committed as revision 24340 to svn://svn.ffmpeg.org/ffmpeg/trunk
so that it does both U and V planes at the same time. This will have speed
advantages when using SSE2 (or higher) optimizations, since we can do both
the U and V rows together in a single xmm register.
This also renames filter16 to filter16y and filter8 to filter8uv so that it's
more obvious what each function is used for.
Originally committed as revision 24337 to svn://svn.ffmpeg.org/ffmpeg/trunk
inner loopfilter, and it also allows us to save one register on x86-64/sse2.
Originally committed as revision 24269 to svn://svn.ffmpeg.org/ffmpeg/trunk
Also make some small changes to saturation order of 4-tap SSSE3 MC to fix a
non-bitexactness bug.
Patch mostly by Eli Friedman <eli.friedman AT gmail DOT com>.
Originally committed as revision 23965 to svn://svn.ffmpeg.org/ffmpeg/trunk
Whose idea was it to have a CPU that didn't SIGILL on an invalid instruction?
Originally committed as revision 23927 to svn://svn.ffmpeg.org/ffmpeg/trunk
- MMXEXT, SSE2 and SSSE3 MC functions
- MMX and SSE4 IDCT dc_add functions
Patch by Jason Garrett-Glaser <darkshikari gmail com> and myself.
Originally committed as revision 23815 to svn://svn.ffmpeg.org/ffmpeg/trunk
Modify the asm accordingly.
GLOBAL is now no longoer necessary for PIC-compliant loads.
Originally committed as revision 23739 to svn://svn.ffmpeg.org/ffmpeg/trunk
Passing an explicit filename to this command is only necessary if the
documentation in the @file block refers to a file different from the
one the block resides in.
Originally committed as revision 22921 to svn://svn.ffmpeg.org/ffmpeg/trunk
This moves the H264-specific functions from DSPContext to the new
H264DSPContext. The code is made conditional on CONFIG_H264DSP
which is set by the codecs requiring it.
The qpel and chroma MC functions are not moved as these are used by
non-h264 code.
Originally committed as revision 22565 to svn://svn.ffmpeg.org/ffmpeg/trunk
This moves the DWT functions from snow.c and dsputil.c to a file of
their own. A new struct, DWTContext, holds the function pointers
previously part of DSPContext.
Originally committed as revision 22522 to svn://svn.ffmpeg.org/ffmpeg/trunk
These macros are redundant. All uses are replaced with the generic
DECLARE_ALIGNED macro instead.
Originally committed as revision 22233 to svn://svn.ffmpeg.org/ffmpeg/trunk
SVQ1 chroma has been special-cased aligned to 16-bytes since at least r15466
Other architectures also assume 16-byte alignment here too but set STRIDE_ALIGN
to 16.
Originally committed as revision 21736 to svn://svn.ffmpeg.org/ffmpeg/trunk
This allows to get rid of the macho64 specific hack that moves them
to rodata (with worse cache behaviour) and avoids textrels which
e.g. Gentoo does not allow for x86_64 libraries.
Originally committed as revision 21551 to svn://svn.ffmpeg.org/ffmpeg/trunk