227 lines
8.2 KiB
Plaintext
227 lines
8.2 KiB
Plaintext
[/
|
|
Copyright Oliver Kowalke 2017.
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt
|
|
]
|
|
|
|
[#tuning]
|
|
[section:tuning Tuning]
|
|
|
|
[heading Disable synchronization]
|
|
|
|
With [link cross_thread_sync `BOOST_FIBERS_NO_ATOMICS`] defined at the
|
|
compiler[s] command line, synchronization between fibers (in different
|
|
threads) is disabled. This is acceptable if the application is single threaded
|
|
and/or fibers are not synchronized between threads.
|
|
|
|
|
|
[heading Memory allocation]
|
|
|
|
Memory allocation algorithm is significant for performance in a multithreaded
|
|
environment, especially for __boost_fiber__ where fiber stacks are allocated on
|
|
the heap. The default user-level memory allocator (UMA) of glibc is ptmalloc2
|
|
but it can be replaced by another UMA that fit better for the concret work-load
|
|
For instance Google[s]
|
|
[@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] enables a
|
|
better performance at the ['skynet] microbenchmark than glibc[s] default memory
|
|
allocator.
|
|
|
|
|
|
[heading Scheduling strategies]
|
|
|
|
The fibers in a thread are coordinated by a fiber manager. Fibers trade control
|
|
cooperatively, rather than preemptively.
|
|
Depending on the work-load several strategies of scheduling the fibers are
|
|
possible [footnote 1024cores.net:
|
|
[@http://www.1024cores.net/home/scalable-architecture/task-scheduling-strategies Task Scheduling Strategies]]
|
|
that can be implmented on behalf of __algo__.
|
|
|
|
* work-stealing: ready fibers are hold in a local queue, when the
|
|
fiber-scheduler's local queue runs out of ready fibers, it randomly
|
|
selects another fiber-scheduler and tries to steal a ready fiber from the
|
|
victim (implemented in __work_stealing__ and __numa_work_stealing__)
|
|
|
|
* work-requesting: ready fibers are hold in a local queue, when the
|
|
fiber-scheduler's local queue runs out of ready fibers, it randomly
|
|
selects another fiber-scheduler and requests for a ready fibers, the victim
|
|
fiber-scheduler sends a ready-fiber back
|
|
|
|
* work-sharing: ready fibers are hold in a global queue, fiber-scheduler
|
|
concurrently push and pop ready fibers to/from the global queue
|
|
(implemented in __shared_work__)
|
|
|
|
* work-distribution: fibers that became ready are proactivly distributed to
|
|
idle fiber-schedulers or fiber-schedulers with low load
|
|
|
|
* work-balancing: a dedicated (helper) fiber-scheduler periodically collects
|
|
informations about all fiber-scheduler running in other threads and
|
|
re-distributes ready fibers among them
|
|
|
|
|
|
[heading TTAS locks]
|
|
|
|
Boost.Fiber uses internally spinlocks to protect critical regions if fibers
|
|
running on different threads interact.
|
|
Spinlocks are implemented as TTAS (test-test-and-set) locks, i.e. the spinlock
|
|
tests the lock before calling an atomic exchange. This strategy helps to reduce
|
|
the cache line invalidations triggered by acquiring/releasing the lock.
|
|
|
|
|
|
[heading Spin-wait loop]
|
|
|
|
A lock is considered under contention, if a thread repeatedly fails to acquire
|
|
the lock because some other thread was faster.
|
|
Waiting for a short time lets other threads finish before trying to enter the
|
|
critical section again. While busy waiting on the lock, relaxing the CPU (via
|
|
pause/yield mnemonic) gives the CPU a hint that the code is in a spin-wait loop.
|
|
|
|
* prevents expensive pipeline flushes (speculatively executed load and compare
|
|
instructions are not pushed to pipeline)
|
|
* another hardware thread (simultaneous multithreading) can get time slice
|
|
* it does delay a few CPU cycles, but this is necessary to prevent starvation
|
|
|
|
It is obvious that this strategy is useless on single core systems because the
|
|
lock can only released if the thread gives up its time slice in order to let
|
|
other threads run. The macro BOOST_FIBERS_SPIN_SINGLE_CORE replaces the CPU
|
|
hints (pause/yield mnemonic) by informing the operating system
|
|
(via `std::this_thread_yield()`) that the thread gives up its time slice and
|
|
the operating system switches to another thread.
|
|
|
|
|
|
[heading Exponential back-off]
|
|
|
|
The macro BOOST_FIBERS_RETRY_THRESHOLD determines how many times the CPU
|
|
iterates in the spin-wait loop before yielding the thread or blocking in
|
|
futex-wait.
|
|
The spinlock tracks how many times the thread failed to acquire the lock.
|
|
The higher the contention, the longer the thread should back-off.
|
|
A ["Binary Exponential Backoff] algorithm together with a randomized contention
|
|
window is utilized for this purpose.
|
|
BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD determines the upper limit of the
|
|
contention window (expressed as the exponent for basis of two).
|
|
|
|
|
|
[heading Speculative execution (hardware transactional memory)]
|
|
|
|
Boost.Fiber uses spinlocks to protect critical regions that can be used
|
|
together with transactional memory (see section [link speculation Speculative
|
|
execution]).
|
|
|
|
[note TXS is enabled if property `htm=tsx` is specified at b2 command-line and
|
|
`BOOST_USE_TSX` is applied to the compiler.]
|
|
|
|
[note A TSX-transaction will be aborted if the floating point state is modified
|
|
inside a critical region. As a consequence floating point operations, e.g.
|
|
tore/load of floating point related registers during a fiber (context) switch
|
|
are disabled.]
|
|
|
|
|
|
[heading NUMA systems]
|
|
|
|
Modern multi-socket systems are usually designed as [link numa NUMA systems].
|
|
A suitable fiber scheduler like __numa_work_stealing__ reduces
|
|
remote memory access (latence).
|
|
|
|
|
|
[heading Parameters]
|
|
|
|
[table Parameters that migh be defiend at compiler's command line
|
|
[
|
|
[Parameter]
|
|
[Default value]
|
|
[Effect on Boost.Fiber]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_NO_ATOMICS]
|
|
[-]
|
|
[no multithreading support, all atomics removed, no synchronization
|
|
between fibers running in different threads]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_STD_MUTEX]
|
|
[-]
|
|
[`std::mutex` used inside spinlock]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS]
|
|
[+]
|
|
[spinlock with test-test-and-swap on shared variable]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE]
|
|
[-]
|
|
[spinlock with test-test-and-swap on shared variable, adaptive retries
|
|
while busy waiting]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_FUTEX]
|
|
[-]
|
|
[spinlock with test-test-and-swap on shared variable, suspend on
|
|
futex after certain number of retries]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX]
|
|
[-]
|
|
[spinlock with test-test-and-swap on shared variable, while busy
|
|
waiting adaptive retries, suspend on futex certain amount of retries]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS + BOOST_USE_TSX]
|
|
[-]
|
|
[spinlock with test-test-and-swap and speculative execution (Intel TSX
|
|
required)]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE + BOOST_USE_TSX]
|
|
[-]
|
|
[spinlock with test-test-and-swap on shared variable, adaptive retries
|
|
while busy waiting and speculative execution (Intel TSX required)]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_FUTEX + BOOST_USE_TSX]
|
|
[-]
|
|
[spinlock with test-test-and-swap on shared variable, suspend on
|
|
futex after certain number of retries and speculative execution
|
|
(Intel TSX required)]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX + BOOST_USE_TSX]
|
|
[-]
|
|
[spinlock with test-test-and-swap on shared variable, while busy
|
|
waiting adaptive retries, suspend on futex certain amount of retries
|
|
and speculative execution (Intel TSX required)]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPIN_SINGLE_CORE]
|
|
[-]
|
|
[on single core machines with multiple threads, yield thread
|
|
(`std::this_thread::yield()`) after collisions]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_RETRY_THRESHOLD]
|
|
[64]
|
|
[max number of retries while busy spinning, the use fallback]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD]
|
|
[16]
|
|
[max size of collisions window, expressed as exponent for the basis of
|
|
two]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPIN_BEFORE_SLEEP0]
|
|
[32]
|
|
[max number of retries that relax the processor before the thread
|
|
sleeps for 0s]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPIN_BEFORE_YIELD]
|
|
[64]
|
|
[max number of retries where the thread sleeps for 0s before yield
|
|
thread (`std::this_thread::yield()`)]
|
|
]
|
|
]
|
|
|
|
[endsect]
|