Under some circumstances, suncc will use a single register for the
address of all memory operands, inserting lea instructions loading
the correct address prior to each memory operand being used in the
code. In the yadif code, the branch in the asm block bypasses such
an lea instruction, causing an incorrect address to be used in the
following load.
This patch replaces the tmpX arrays with a single array and uses a
register operand to hold its address. Although this prevents using
offsets from the stack pointer to access these locations, the code
still builds as 32-bit PIC even with old compilers.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This fixes two issues preventing suncc from building this code.
The undocumented 'a' operand modifier, causing gcc to omit a $ in
front of immediate operands (as required in addresses), is not
supported by suncc. Luckily, the also undocumented 'c' modifer
has the same effect and is supported.
On some asm statements with a large number of operands, suncc for no
obvious reason fails to correctly substitute some of the operands.
Fortunately, some of the operands in these statements are plain
numbers which can be inserted directly into the code block instead
of passed as operands.
With these changes, the code builds correctly with both gcc and
suncc.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This code contains a C array of addresses of labels defined in
inline asm. To do this, the names must be declared as external
in C. The declared type does not matter since only the address is
used, and for some reason, the author of the code used the 'void'
type despite taking the address of a void expression being invalid.
Changing the type to char, a reasonable choice since the alignment
of the code labels cannot be known or guaranteed, eliminates gcc
warnings and allows building with suncc.
Signed-off-by: Mans Rullgard <mans@mansr.com>
many branches and cases of scale_vector are irrelevant for the case here
and by inlining they can be reliably removed.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
It appears someone thinks this special case can be reached
Well, it cannot, thus not only do we not need to optimize it
we dont need it at all
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master: (22 commits)
g723.1: do not pass large structs by value
g723.1: do not bounce intermediate values via memory
g723.1: declare a variable in the block it is used
g723.1: avoid saving/restoring excitation
g723.1: avoid unnecessary memcpy() in residual_interp()
g723.1: make postfilter write directly to output buffer
g723.1: drop unnecessary variable buf_ptr in formant_postfilter()
g723.1: make scale_vector() output to a separate buffer
g723.1: make autocorr_max() work on an arbitrary buffer
g723.1: do not needlessly use int64_t
g723.1: use saturating addition functions
g723.1: optimise scale_vector()
g723.1: remove useless uses of MUL64()
g723.1: remove unnecessary argument 'shift' from dot_product()
g723.1: deobfuscate "(x << 4) - x" to "15 * x"
celp: optimise ff_celp_lp_synthesis_filter()
libavutil: add saturating addition functions
cllc: Implement ARGB support
cllc: Add support for QRGB
cllc: Rename some funcs to represent what they actually do
...
Conflicts:
LICENSE
libavcodec/g723_1.c
libavcodec/x86/Makefile
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Allow to select specific documentation components, and reliably check for
component dependencies.
In particular, check for perl presence on the system.
This is a port of the MPlayer hue filter (libmpcodecs/vf_hue.c) by
Michael Niedermayer.
Signed-off-by: Jérémy Tran <tran.jeremy.av@gmail.com>
Signed-off-by: Stefano Sabatini <stefasab@gmail.com>
Although a reasonable compiler will probably optimise out the
actual store and load, this operation still implies a truncation
to 16 bits which the compiler will probably not realise is not
necessary here.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Writing the scaled excitation to a scratch buffer (borrowing the
'audio' array) instead of modifying it in place avoids the need
to save and restore the unscaled values.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Use saturating addition functions instead of 64-bit intermediates
and separate clipping. This is much faster when dedicated
instructions are available.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Firstly, nothing in this function can overflow 32 bits so the use
of a 64-bit type is completely unnecessary. Secondly, the scale
is either a power of two or 0x7fff. Doing separate loops for these
cases avoids using multiplications. Finally, since only the number
of bits, not the actual value, of the maximum value is needed, the
bitwise or of all the values serves the purpose while being faster.
It is worth noting that even if overflow could happen, it was not
handled correctly anyway.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The operands in both cases are 16-bit so cannot overflow a 32-bit
destination. In gain_scale() the inputs are reduced to 14-bit,
so even the shift cannot overflow.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Adding instead of subtracting the products in the loop allows the
compiler to generate more efficient multiply-accumulate instructions
when 16-bit multiply-subtract is not available. ARM has only
multiply-accumulate for 16-bit operands. In general, if only one
variant exists, it is usually accumulate rather than subtract.
In the same spirit, using the dedicated saturation function enables
use of any special optimised versions of this.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Fixed-point audio codecs often use saturating arithmetic, and
special instructions for these operations are common.
Signed-off-by: Mans Rullgard <mans@mansr.com>
The script was previously run with perl -w through the shebang
command. Now that the script is executed through direct perl invocation
the -w in the shebang command is ignored. This patch re-enables "use
warnings" whatever way the script is invoked.
Idea-By: jamal <jamrial@gmail.com>
About 30% faster on 32 bit Atom, 120% faster on 64 bit Phenom2.
This is interesting because supporting P16 is easier in e.g.
OpenGL (can misuse support for any 2-component 8 bit format),
whereas supporting p9/p10 without conversion needs a texture
format with at least 14 bits actual precision.
The shiftonly == 0 case is not optimized since the code is more
complex and the speed gain less obvious.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
It appears this function is not available everywhere
Should fix Ticket1525
Reviewed-by: Marton Balint <cus@passwd.hu>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>