This removes a dependency on implementation details from generic
code and allows easy addition of the equivalent optimisation for
other architectures than x86.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register. get_cabac() gets about 40% faster, for
an overall speedup of about 5%.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
The reason is this is easier for PIC code (in particular on darwin...).
Keep the old names as pointers (static in cabac_functions.h so gcc
knows these are just immediate offsets) so the c code can nicely stay the same
(alternatively could use offsets directly in the functions needing the
tables). This should produce the same code as before with non-pic and better
code (confirmed) with pic.
The assembly uses the new table but still won't work for PIC case.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Conversion of the luma intra prediction mode to one of the constrained
("alzheimer") ones can happen by crafting special bitstreams, causing
a crash because we'll call a NULL function pointer for 16x16 block intra
prediction, since constrained intra prediction functions are only
implemented for chroma (8x8 blocks).
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
This fixes standalone compilation of some decoders with --disable-optimizations.
cabac.h defines some inline functions that use symbols from cabac.c. Without
optimizations these inline functions are not eliminated and linking fails with
references to non-existing symbols.
Splitting the inline functions off into their own header and only #including
it in the places where the inline functions are used allows #including cabac.h
from anywhere without ill effects.
Originally, prior to 8742a4ff8, the caller code was compiled
within this condition:
ARCH_X86 && HAVE_7REGS && HAVE_EBX_AVAILABLE && !defined(BROKEN_RELOCATIONS)
Since HAVE_7REGS is defined as
(ARCH_X86_64 || (HAVE_EBX_AVAILABLE && HAVE_EBP_AVAILABLE))
the subcondition HAVE_7REGS && HAVE_EBX_AVAILABLE is equal
to HAVE_7REGS (for 32 bit at least). The correct simplification
of the original condition thus is HAVE_7REGS, not
HAVE_EBX_AVAILABLE.
This fixes compilation in some cases where HAVE_EBP_AVAILABLE = 0
and HAVE_EBX_AVAILABLE = 1.
Signed-off-by: Martin Storsjö <martin@martin.st>
On 32-bit OS X with gcc 4.0/4.2 and shared libraries enabled, the ebx register
is not available, but required to assemble the functions.
This reverts commit 8742a4f to a simplified version of the original constraints.
FF_COMMON_FRAME holds the contents of the AVFrame structure and is also copied
to struct Picture. Replace by an embedded AVFrame structure in struct Picture.
In high bit depth, the QP values may now be up to (51 + 6*(bit_depth-8)).
Preparatory patch for high bit depth h264 decoding support.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
In high bit depth the pixels will not be stored in uint8_t like in the
normal case, but in uint16_t. The pixel size is thus 1 in normal bit
depth and 2 in high bit depth.
Preparatory patch for high bit depth h264 decoding support.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Improves CABAC performance about ~1.2%.
Trick originates from x264 and has also been used in ffvp8. It's useful because
coded block flags are usually zero, so it helps to have the early termination
inlined into the main function.
Originally committed as revision 26375 to svn://svn.ffmpeg.org/ffmpeg/trunk
No speed improvement, but necessary for some future stuff.
Also opens up the possibility of asm chroma dc idct/dequant.
Originally committed as revision 26349 to svn://svn.ffmpeg.org/ffmpeg/trunk
Useful so that we don't have to run the hierarchical DC iDCT if there aren't
any coefficients. Opens up some future opportunities for optimization as well.
Originally committed as revision 26337 to svn://svn.ffmpeg.org/ffmpeg/trunk
About 2.5x the speed.
NOTE: the way that the asm code handles large qmuls is a bit suboptimal.
If x264-style dequant was used (separate shift and qmul values), it might
be possible to get some extra speed.
Originally committed as revision 26336 to svn://svn.ffmpeg.org/ffmpeg/trunk
Passing an explicit filename to this command is only necessary if the
documentation in the @file block refers to a file different from the
one the block resides in.
Originally committed as revision 22921 to svn://svn.ffmpeg.org/ffmpeg/trunk
same speed as far as i can meassure but it might have fewer branches on some
archs.
Idea from x264 / jason
Originally committed as revision 22027 to svn://svn.ffmpeg.org/ffmpeg/trunk
the row code. This function would only be needed on a MB basis for MBAFF+FMO
Originally committed as revision 21860 to svn://svn.ffmpeg.org/ffmpeg/trunk
Sadly this is just per slice so the speedup with normal files should be negligible.
Originally committed as revision 21859 to svn://svn.ffmpeg.org/ffmpeg/trunk
Run loop filter per row instead of per MB, this also should make it
much easier to switch to per frame filtering and also doing so in a
seperate thread in the future if some volunteer wants to try.
Overall decoding speedup of 1.7% (single thread on pentium dual / cathedral sample)
This change also allows some optimizations to be tried that would not have
been possible before.
Originally committed as revision 21270 to svn://svn.ffmpeg.org/ffmpeg/trunk
No benchmark because its just replacing variables with litteral constants
(so no risk for slowdown outside gcc silliness) and i need sleep.
Originally committed as revision 21237 to svn://svn.ffmpeg.org/ffmpeg/trunk