SSE-optimized hScale() scales up to 4 pixels at once, so we need to
allocate up to 3 padding pixels to prevent overreads. This fixes
valgrind errors in various swscale-tests on fate.
Speed: from 3.9x to 9.6x speed improvement over C, and some small
(up to 15%) speed improvements over existing MMX code (particularly
for bigger filters).
This allows using more specific implementations for chroma/luma, e.g.
we can make assumptions on filterSize being constant, thus avoiding
that test at runtime.
It just does that part in scalar form, I doubt using a vector store
over 2 array would speed it up particularly.
The function should be written to not use a scratch buffer.
The logged information is possibly false, and it tends to be outdated
after each change since the logging code needs to be manually updated.
Simplify and prevent confusing wrong debug messages.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Also remove the unnecessary isSupportedIn/Out macros.
Make the code more compact/readable, and simplify the access to
lsws-specific pixel format information.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
When converting RGB format to RGB format with the same bits per sample,
unscaled path performs conversion on the whole buffer at once. For
non-multiple-of-16 BGR24 to RGB24 conversion it means that padding at the
end of line will be converted too. Since it may be of arbitrary length
(e.g. 8 bytes), operating on the whole buffer produces obviously wrong
results.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
ptrdiff_t can be 4 bytes, which leads to the next element being 4-byte
aligned and thus at a different offset than intended. Forcing 8-byte
alignment forces equal offset of dither16/32 on x86-32 and x86-64.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
We operated on 31-bits, but with e.g. lanczos scaling, values can
add up to beyond 0x80000000, thus leading to output of zeroes. Drop
one bit of precision fixes this.
Remove unused variables "flags" and "dstFormat" in yuv2packed1,
merge source rows per plane for yuv2packed[12], and make every
source argument int16_t (some where invalidly set to uint16_t).
This prevents stack pollution and is part of the Great Evil Plan
to simplify swscale.
This will likely lead to a considerable performance boost,
since it removes a branch from the inner loop. Part of the
Great Evil Plan to simplify swscale.
On architectures such as x86 (both 32 bit and 64bit), the stack element
size is fixed, which maintains alignment. Here, this change does not
break anything. However, we also support also other architectures where
this property is not maintained and therefore, applications will crash
horribly.
This change effectively forces all applications to be recompiled against
libswscale.
This is part of the Great Evil Plan to simplify swscale. Note that
you'll see some code duplication between the output functions for
different RGB variants, and even between packed-YUV and RGB
variants. This is intentional because it improves readability.
Inline functions are easier to read, maintain, modify and test,
which justifies the slightly increased source size. This patch
also adds support for non-native endianness RGB15/16 and fixes
isSupportedOutput() to no longer claim that we support writing
non-native RGB565/555/444.
Remove inline keyword from functions that are never inlined.
Use av_always_inline for functions that should be force-inlined
for performance reasons. Use av_cold for init functions.
Remove inline keyword for functions that are only called through
their function pointers (and thus cannot be inlined); add av_cold
keyword to init function, and use av_always_inline instead of
inline for functions that must be inlined for performance reasons.
This prevents the following compiler warnings: "warning:
initialization from incompatible pointer type". Since the
variables are only ever used in inline assembly, their type
is actually irrelevant (so the part where it was wrong did
not invoke any buggy behaviour).
They are hacks added to reuse the same scaling function for
different formats and they may cause problems when SIMD
implementation of the same functions are used along with pure
C functions.
Remove duplicate "inC" and "_c" functions that do the same thing;
give each function that handles data and acts as a function pointer
a "_c" suffix; remove "_c" suffix from functions that are inherently
not optimizable. Remove inline keyword from functions that are only
used through function pointers.
Many functions have such a prefix, but do not actually use any
instructions or features from that set, thus giving the false
impression that swscale is highly optimized for a particular
system, whereas in reality it is not.
Interleave macros and code so that it's easier to find the
actual code that belongs to a function. Also reindent where
appropriate and remove dead code.
Instead, only set the function pointers if bitexact flag is
not set during initialization. Since a change in flags triggers
a re-init anyway, this doesn't situations where flag values
change during runtime.
The functions are identical to their MMX counterparts. Thus,
pretending that swscale is highly optimized for AMD3DNOW
extensions is a poorly executed practical joke at best.
Also remove code that overwrites the C versions of functions in
sws_init_swScale_altivec(), so that it uses the C functions of files
if no altivec-optimized version exists.
Adding _POSIX_C_SOURCE to CPPFLAGS globally produces all sorts of problems
since it causes certain system functions to be hidden on some (BSD) systems.
The solution is to only add the flag on systems that really require it, i.e.
glibc-based ones.
This change makes BSD systems compile out-of-the-box without the need for
adding specific flags manually. It also allows dropping a number of flags
set manually on a file-per-file basis, but were only present to work around
breakage introduced by the presence of _POSIX_C_SOURCE.
Also add _XOPEN_SOURCE to CPPFLAGS for glibc systems. We use XSI extensions
in several places already, so it is preferable to define it globally instead
of littering source files with individual #defines only needed for glibc.
Fix handling of input if not in native endianness, and add support for
9/10-bit output. This allows us to force endianness of YUV420P 9/10bit
in the H264/10bit fate tests, which should fix them on big-endian
systems.
PPC and x86 code is split off from swscale_template.c. Lots of code is
still duplicated and should be removed later.
Again uniformize the init system to be more similar to the dsputil one.
Unset h*scale_fast in the x86 init in order to make the output
consistent with the previous status. Thanks to Josh for spotting it.
Keep only the plain C code in the main rgb2rgb.c and move the x86
specific optimizations to x86/rgb2rgb.c
Change the initialization pattern a little so some of it can be
factorized to behave more like dsputils.
When HAVE_7REGS was not defined these functions had an empty body
causing the following warnings during compilation.
In file included from libswscale/x86/yuv2rgb_mmx.c:58:
libswscale/x86/yuv2rgb_template.c: In function ‘yuva420_rgb32_MMX’:
libswscale/x86/yuv2rgb_template.c:412: warning: no return statement in function returning non-void
libswscale/x86/yuv2rgb_template.c: In function ‘yuva420_bgr32_MMX’:
libswscale/x86/yuv2rgb_template.c:457: warning: no return statement in function returning non-void
Signed-off-by: Diego Biurrun <diego@biurrun.de>
It is pretty hopeless that other considerable projects will adopt
libavutil alone in other projects. Projects that need small footprint
are better off with more specialized libraries such as gnulib or rather
just copy the necessary parts that they need. With this in mind, nobody
is helped by having libavutil and libavcore split. In order to ease
maintenance inside and around FFmpeg and to reduce confusion where to
put common code, avcore's functionality is merged (back) to avutil.
Signed-off-by: Reinhard Tartler <siretart@tauware.de>