Previously the assembly sources had mixed indentation consisting
of both spaces and tabs, making it quite hard to read unless
the right tab size was used in the editor.
Tabs have been interpreted as 4 spaces in most cases, matching
the surrounding code.
Instead of loading the registers one lane at a time, load full
registers and then transpose them.
This is faster, reducing the runtime for the function from about
506 cycles to 434 cycles (tested on a Cortex A8).
This also avoids an issue which seems like a cpu bug, present
on Sony Xperia T (cpu implementer 0x51 architecture 7 variant 0x1
part 0x04d). On such a device, it seemed like the "vswp q9, q10"
could start executing before the previous
vld4.u8 {d20[x],d21[x],d22[x],d23[x]}, [r3], r1
had finished and written back their result. Changing the
"vswp q9, q10" into "vswp q10, q9", or into separate
"vswp d18, d20; vswp d19, d21" (or the other way around) seemed to
avoid the issue. This happened occasionally (a couple times per
100000 invocations or so).
This fixes building with gnu binutils, which don't support this
nonstandard form of the instructions.
Once Apple's tools support the proper standard form of the
instructions, the code should be updated to use that everywhere
instead, and these macros should be removed.
This makes sure the windows version of these functions behave
more like the posix version. The posix *snprintf returns how
much would have been written if the buffer had been large
enough, which we don't know easily in the windows versions.
This basically means that we can assume that the return value is
>= 0 now, which can simplify the calling code.
Windows Phone lacks the old CreateThread/beginthreadex APIs for
creating threads. (Technically, the functions still do exist,
but they aren't officially supported and aren't visible in the
headers when targeting Windows Phone.)
Building code that uses the Windows Runtime language extensions
requires building with the -ZW option.
According to the Win64 ABI, these registers need to be preserved,
and compilers are allowed to rely on their content to stay
available - not only for float usage but for any usage, anywhere,
in the calling C++ code.
This adds a macro which pushes the clobbered registers onto the
stack if targeting win64 (and a matching one which restores them).
The parameter to the macro is the number of xmm registers used
(e.g. if using xmm0 - xmm7, the parameter is 8), or in other
words, the number of the highest xmm register used plus one.
This is similar to how the same issue is handled for the NEON
registers q4-q7 with the vpush instruction, except that they needed
to be preserved on all platforms, not only on one particular platform.
This allows removing the XMMREG_PROTECT_* hacks, which can
easily fail if the compiler chooses to use the callee saved
xmm registers in an unexpected spot.
This is what nasm ended up assembling movsx with 32 bit input to
anyway.
Keep using plain movsx for 16 bit input.
This fixes building with yasm in 64 bit mode.
Remap q5 to q8, q6 to q9, q7 to q10 and q8 to q11, and push
q4 to the stack.
This was missed previously since the codec unittest doesn't
test encoding with loop filter enabled yet.
According to the calling convention, the registers q4-q7 should be
preserved by functions. The caller (generated by the compiler) could
be using those registers anywhere for any intermediate data.
Functions that use more than 12 of the qX registers must push
the clobbered registers on the stack in order to be able to restore them
afterwards.
In functions that don't use all 16 registers, but clobber some of
the callee saved registers q4-q7, one or more of them are remapped
to reduce the number of registers that have to be saved/restored.
This incurs a very small (around 0.5%) slowdown in the decoder and
encoder.
According to the calling convention, the registers q4-q7 should be
preserved by functions. The caller (generated by the compiler) could
be using those registers anywhere for any intermediate data.
Functions that use 12 or less of the qX registers can avoid
violating the calling convention by simply using other registers instead
of the callee saved registers q4-q7.
This change only remaps the registers used within functions - therefore
this does not affect performance at all. E.g. in functions using
registers q0-q7, we now use q0-q3 and q8-q11 instead.
Now calling WelsThreadJoin is enough to finish and clean up
the thread on all platforms.
This unifies the thread cleanup code between windows and unix.
Now all of the threading code should use the exact same codepaths
between windows and unix.
Arm assembly has got two variants of the syntax, the old legacy
syntax, and the new modern UAL (unified assembly language) syntax.
Most arm assembly is the same in the both syntaxes, but some
uncommon cases change the order of suffixes - the "subscs"
instruction would be written "subcss" in the old syntax.
The apple tools default to UAL, while the GNU tools (e.g. in
android) require you to specify ".syntax unified" to enable the
new syntax. When enabling the new syntax with the GNU tools, some
cases of "sub r0, r1, lsl #1" needs to be written explicitly as
"sub r0, r0, r1, lsl #1", handled in the previous commit.
This allows using the same, modern syntax for things like subscs,
without needing to have two alternate forms of writing it.
On arm, the exact same detection is done in WelsCPUFeatureDetect,
but in the x86 version of that function we use x86 cpuid for getting
the core count, and this is not available on all processors. For the
case when cpuid can't tell the core count, use the NDK function as
higher level API.
The thread lib itself doesn't build properly on android yet, but will
do so soon.
This allows making the WelsMultipleEventsWaitSingleBlocking
function work properly in unix, without polling. If a master
event is provided, the function first waits for a signal on
that event - once such a signal is received, it is assumed that
one of the individual events in the list have been signalled as
well. Then the function can proceed to check each of the semaphores
in the list using sem_trywait to find the first one of them that
has been signalled. Assuming that the master event is signalled
in pair with the other events, one of the sem_trywait calls
should succeed.
The same master event is also used in
WelsMultipleEventsWaitAllBlocking, to keep the semaphore values
in sync across calls to the both functions.
All users of the function passed the value corresponding to
"infinite", and the (currently unused) unix implementation of it
only supported infinite wait as well.
This avoids having to hardcode the names of devices that don't
support neon.
The devices that don't support neon don't run the armv7 variants
of iOS binaries at all - they would need to be built for the armv6
architecture. (Building for armv6 isn't supported at all in
modern iOS SDKs.)
Therefore we can simply use the __ARM_NEON__ built-in compiler
define to check if NEON code is allowed in the current build,
and have the WelsCPUFeatureDetect function return flags accordingly.
The only thing this disallows is doing an armv6 build which would
optionally enable neon code at runtime if run on an armv7 capable
device, but since Apple allows you to build the same binary for
armv7 separately in the same app bundle, and since armv6 building
isn't even possible in the current iOS SDKs, this isn't really a loss.
This is in contrast to the android builds where the armv7 baseline
does not include NEON.
This avoids the risk of namespace collisions for named semaphores
(where the names are global for the whole machine), on platforms
where we strictly don't need to use the named semaphores.
This unifies the event creation interface, even if the event
name itself is unused on windows, allowing use the exact same
code to initialize events regardless of the actual platform.
Some ifdefs still remain in the event initialization code, since
some events are only used on windows.
Typedeffing WELS_EVENT as sem_t* makes the typedef behave similarly
to the windows version (typedeffed as HANDLE), unifying the code
that allocates and uses these event objects (getting rid of
most of the need for separate codepaths and ifdefs).
The caller of the function should not need to know exactly which
implementation of it is being used.
For the variants that don't support detecting the number of cores,
the pNumberOfLogicProcessors parameter can be left untouched
and the caller will use a higher level API for finding it out.
This simplifies all the calling code, and simplifies adding
more implementations of cpu feature detection.
The two different variants of the threadlib basically are
win32 and unix - use _WIN32 to check for this consistently,
instead of occasionally using __GNUC__ to enable the unix
codepath. (__GNUC__ is also defined on mingw, which still is
a windows platform and should use the _WIN32 code.)
When adding the (dwMilliseconds % 1000) * 1000000 part
to ts.tv_nsec, the ts.tv_nsec field can grow larger than one
whole second. Therefore first add all of dwMilliseconds to
the tv_nsec field and add all whole seconds to the tv_sec
field instead - this way we make sure that the tv_nsec field
actually is less than a second.
On processors without HTT, WelsCPUFeatureDetect can't return
a number of cores but might still return a nonzero set of
CPU feature flags. Previously the nonzero cpu feature flag
indicated that cpuid worked and the encoder wouldn't use the
higher level API for getting the number of cores, even though the
number of cores was left at 1.
TRUE/FALSE has intentionally been left in use for the few
platform specific APIs that define these constants themselves
and expect them to be used, for consistency.
Instead of byteswapping a 32 bit word and writing it out as a
whole (which could even possibly lead to crashes due to
incorrect alignment on some platforms), write it out explicitly
in the intended byte order.
This avoids having to set a define indicating the endianness.
The code interprets an array of 4 uint8_t values as one uint32_t
and does shifts on the value. The same optimization can be
kept in big endian as well, but the shift has to be done in the
other direction.
This code could be made truly independent of endianness, but
that could cause some minimal performance degradaion, at least
in theory.
This makes "make test" pass on big endian, assuming that
WORDS_BIGENDIAN is defined while building.
This is a more convenient behaviour (truncating on overflow and
always null terminating the buffer) compared to the MSVC
safe strcat_s which aborts the process if the string doesn't fit
into the target buffer.
Also mark the source buffer as const in the function prototype.
Make the MSVC "safe" version truncate instead of aborting the
process if the buffer is too small.
Update all the other functions to use the right parameter
(iSizeInBytes, not iCount) as 'n' parameter to strncpy.
(By passing iCount as parameter to the normal strncpy functions,
it meant that the resulting buffer actually never was null
terminated.)
Additionally make sure that the other implementations of WelsStrncpy
always null terminate the resulting buffer, just as the MSVC safe
version does when passed the _TRUNCATE parameter.
These were essentially useless - if strlen() ever was used as
fallback, it either indicated that those ports of the library
were insecure, or that strnlen never was required at all.
In this case it turned out to be the latter (at least after
the preceding cleanups) - all uses of it were with known null
terminated strings.
As long as WelsFileHandle* is equal to FILE* this doesn't matter,
but for consistency use the WelsF* functions for all handles
opened by WelsFopen, and use WelsFileHandle* as type for it
instead of FILE*.
Both encoder and decoder versions were functionally equivalent,
but I picked the decoder version (but added the static inline
keywords to it) since the encoder one was quite messy with a lot
of commented out code.
If the buffer is too small, there's no guarantee that it is
null terminated. The docs (on both unix and MSVC) say explicitly
that the function returns 0 and the buffer contents are
indeterminate in this case.
Otherwise builds on platforms other than MSVC might be
insecure.
Use vsnprintf_s with the _TRUNCATE flag instead of vsprintf_s
when using MSVC - this truncates the buffer instead of aborting
the whole process in case it's too small.
The decoder used WelsMedian while the encoder used WELS_MEDIAN.
The former has two different implementations, WELS_MEDIAN was
identical to the disabled version of WelsMedian.
Settle on using the same implementation for both decoder and
encoder - whichever version of the implementations is faster
should be used for both.
Also use the __APPLE__ predefined define instead of MACOS for enabling
these code paths.
This also avoids having to link to the CoreServices framework in
order to get the Gestalt function.
This gets rid of the code that parses /proc/cpuinfo, and avoids
forking within the library.
The previous code also failed build on modern glibc versions
due to ignoring the return value of the system, read and write
system calls.
While building succeeds in MSVC 2008, it currently fails in 2012
due to error C1189 "The C++ standard library forbids macroizing
keywords", which is caused by doing "#define inline __inline" in
the macros.h header.
This could have been missed before since it only was triggered
if macros.h was included before some other system header was
included that contained checks against these inline defines.
Using esp works by coincidence as long as the stack pointer is
within the first 4 GB of the address space - which seems to work
as long as the test binary is built with /dynamicbase:no, but breaks
if this option is removed.
Commit f38111d76b updated these files
manually (based on older versions of them) to something not generated
by the current mktargets.py/sh, losing the compact pattern rules.