Now calling WelsThreadJoin is enough to finish and clean up
the thread on all platforms.
This unifies the thread cleanup code between windows and unix.
Now all of the threading code should use the exact same codepaths
between windows and unix.
Arm assembly has got two variants of the syntax, the old legacy
syntax, and the new modern UAL (unified assembly language) syntax.
Most arm assembly is the same in the both syntaxes, but some
uncommon cases change the order of suffixes - the "subscs"
instruction would be written "subcss" in the old syntax.
The apple tools default to UAL, while the GNU tools (e.g. in
android) require you to specify ".syntax unified" to enable the
new syntax. When enabling the new syntax with the GNU tools, some
cases of "sub r0, r1, lsl #1" needs to be written explicitly as
"sub r0, r0, r1, lsl #1", handled in the previous commit.
This allows using the same, modern syntax for things like subscs,
without needing to have two alternate forms of writing it.
There's a different version of the same function in the encoder,
but they're not identical - the encoder version has got stricter
alignment requirements.
If someone can confirm that it is ok to use the function from the
encoder, pixel_sad_neon.S in processing could be deleted, and the
encoder version moved to codec/common instead.
This avoids using a separate thread for handling pUpdateMbListEvent
events, and later allowing using the encode exit event on unix instead
of pthread cancellation.
This allows using the same codepath for both unix and windows
for distributing new slices to code to threads.
This also improves the performance on unix - instead of waiting
for all the current threads to finish their current slice
before handing out a new slice to each of them (where the threads
that finish first will just wait instead of immediately getting
a new slice to work on), we now use the same logic as on windows.
In one setup, it improves the performance of encoding from ~920 fps
to ~950 fps, and in another setup it goes from ~390 fps to ~660 fps.
(These tests were done with the SM_ROWMB_SLICE mode, which
heavily exercises the code for distributing new slices to the
worker threads.)
The extra WelsEventSignal call on windows where it isn't strictly
necessary doesn't incur any measurable slowdown, so it is kept
without any extra ifdefs to keep the code more readable and unified.
On arm, the exact same detection is done in WelsCPUFeatureDetect,
but in the x86 version of that function we use x86 cpuid for getting
the core count, and this is not available on all processors. For the
case when cpuid can't tell the core count, use the NDK function as
higher level API.
The thread lib itself doesn't build properly on android yet, but will
do so soon.
This allows making the WelsMultipleEventsWaitSingleBlocking
function work properly in unix, without polling. If a master
event is provided, the function first waits for a signal on
that event - once such a signal is received, it is assumed that
one of the individual events in the list have been signalled as
well. Then the function can proceed to check each of the semaphores
in the list using sem_trywait to find the first one of them that
has been signalled. Assuming that the master event is signalled
in pair with the other events, one of the sem_trywait calls
should succeed.
The same master event is also used in
WelsMultipleEventsWaitAllBlocking, to keep the semaphore values
in sync across calls to the both functions.
All users of the function passed the value corresponding to
"infinite", and the (currently unused) unix implementation of it
only supported infinite wait as well.
This avoids having to hardcode the names of devices that don't
support neon.
The devices that don't support neon don't run the armv7 variants
of iOS binaries at all - they would need to be built for the armv6
architecture. (Building for armv6 isn't supported at all in
modern iOS SDKs.)
Therefore we can simply use the __ARM_NEON__ built-in compiler
define to check if NEON code is allowed in the current build,
and have the WelsCPUFeatureDetect function return flags accordingly.
The only thing this disallows is doing an armv6 build which would
optionally enable neon code at runtime if run on an armv7 capable
device, but since Apple allows you to build the same binary for
armv7 separately in the same app bundle, and since armv6 building
isn't even possible in the current iOS SDKs, this isn't really a loss.
This is in contrast to the android builds where the armv7 baseline
does not include NEON.
This avoids the risk of namespace collisions for named semaphores
(where the names are global for the whole machine), on platforms
where we strictly don't need to use the named semaphores.
This unifies the event creation interface, even if the event
name itself is unused on windows, allowing use the exact same
code to initialize events regardless of the actual platform.
Some ifdefs still remain in the event initialization code, since
some events are only used on windows.
There is no point in doing a timed wait here - there's no work
that we can do if the wait timed out, and sleeping for 1 ms
inbetween doesn't help, it only adds potential extra latency
to reacting to threads that need more work to do.
Typedeffing WELS_EVENT as sem_t* makes the typedef behave similarly
to the windows version (typedeffed as HANDLE), unifying the code
that allocates and uses these event objects (getting rid of
most of the need for separate codepaths and ifdefs).
The caller of the function should not need to know exactly which
implementation of it is being used.
For the variants that don't support detecting the number of cores,
the pNumberOfLogicProcessors parameter can be left untouched
and the caller will use a higher level API for finding it out.
This simplifies all the calling code, and simplifies adding
more implementations of cpu feature detection.
The two different variants of the threadlib basically are
win32 and unix - use _WIN32 to check for this consistently,
instead of occasionally using __GNUC__ to enable the unix
codepath. (__GNUC__ is also defined on mingw, which still is
a windows platform and should use the _WIN32 code.)
The iFrameWidth/iFrameHeight fields are already aligned by the
SetActualPicResolution() function. Previously when iFrameWidth was
aligned directly in ParamBaseTranscode, this aligned value was used
to set iActualWidth/iActualHeight - losing the original, cropped
size.
This makes sure the output bitstream from the test of encoding
res/Static_152_100.yuv actually is cropped as it should.
When adding the (dwMilliseconds % 1000) * 1000000 part
to ts.tv_nsec, the ts.tv_nsec field can grow larger than one
whole second. Therefore first add all of dwMilliseconds to
the tv_nsec field and add all whole seconds to the tv_sec
field instead - this way we make sure that the tv_nsec field
actually is less than a second.
This simplifies forward source compatibility when new fields are
added to SEncParamExt - when new fields are added to SEncParamExt,
this method makes sure those fields are initialized to the
default value - otherwise all API users would have to manually check
SEncParamExt every time it is updated to make sure there's no new fields
that should be set to a nonzero value by default (e.g. like
bEnableFrameSkip).
On processors without HTT, WelsCPUFeatureDetect can't return
a number of cores but might still return a nonzero set of
CPU feature flags. Previously the nonzero cpu feature flag
indicated that cpuid worked and the encoder wouldn't use the
higher level API for getting the number of cores, even though the
number of cores was left at 1.
This reduces the huge amount of near-useless small extra files
scattered around for the sake of the platform demo projects.
This requires explicitly listing all the ncessary include paths.
A similar parameter already exists in the other version of
the ParseCommandLine function.
The WelsStderrSetTraceLevel isn't one of the functions that
is exported from wels.dll (nor welsenc.dll) though, so this
doesn't work (not currently either since the function is
currently already referenced) if we would try to link to the
encoder library dynamically.
Previously the loop filter was unconditionally enabled
regardless of what encoder parameter was set. If using
SEncParamBase instead, the loop filter was always disabled.
Previously, these fields kept whatever value was set by
FillDefault. The corresponding fields were set properly within
sSpatialLayers, but the fields within the main struct were left
with the default values.
This doesn't change the hashes in the unit test, since these
fields don't seem to be used in the produced bitstream at all.