June 22, 2016 by Ed Schouten
It goes without saying that CloudABI supports running multi-threaded code. Like most UNIX-like systems out there, CloudABI implements the POSIX threads API, or pthreads for short. In today’s article, let’s focus on a single, yet crucial aspect of this API, namely synchronisation primitives (e.g., mutexes).
In this article we will discuss how the synchronisation primitives provided by the POSIX threads API are implemented on operating systems like Linux and FreeBSD. Later on we’ll see how CloudABI implements these and how that differs from other systems, making use of something that I call hierarchical futexes. Let’s start off with an overview of the POSIX threads API.
The most commonly used synchronisation primitive provided by the POSIX
threads API is the mutex, having type
pthread_mutex_t. By default, mutexes are not recursive,
nor do they perform any priority inheritance.
These features can be enabled separately by using the
pthread_mutexattr_*() functions to alter the mutex’s initialisation
In essence, it is possible to perform two operations on these mutexes, namely to lock and unlock them. A mutex may only be unlocked by the thread that has it locked (its owner). The API for these two operations is as follows (disregarding features such as timeouts, try-locking, etc.):
In addition to plain mutexes, the POSIX threads API also provides
support for readers-writer locks,
pthread_rwlock_t. Where these locks differ from mutexes is
that they perform better for protecting shared datastructures that are
only modified infrequently. They either allow multiple readers or a
single writer to acquire the lock, hence the name.
There are some features that readers-writer locks lack in comparison to plain mutexes. Though they can be acquired for reading recursively, there is no support for write recursion. Implementations are also not required to implement any priority inheritance scheme, meaning that starvation may occur easily. For this reason, write locks should preferably only be acquired infrequently and as briefly as possible.
Readers-writer locks have three operations, namely read-locking, write-locking and unlocking. The unlocking function automatically determines whether the lock should be unlocked for reading or writing. The API for these three operations is as follows:
Another important primitive provided by the POSIX threads API is
pthread_cond_t, which implements a condition variable.
Condition variables may be used by a thread to temporarily release a
mutex to wait for an event to occur. Other threads can signal the
condition variable to wake up a single waiter. It is also possible to
wake up all waiters at once by broadcasting. POSIX requires that if a
condition variable has multiple waiters, all of them must have released
the same mutex. Its API is as follows:
The POSIX threads API also provides some additional synchronisation primitives which we will only describe briefly:
pthread_barrier_t: a barrier, which can be used to let threads in a thread pool wait for each other’s completion.
pthread_once_t: an initialisation guard, which can be used to execute a callback function exactly once.
pthread_spinlock_t: a spinlock. Locking a busy mutex will cause the thread to suspend its execution and switch to another thread. Spinlocks, however, will cause the calling thread to spin on the lock instead. Spinlocks should generally be avoided, as they can be quite wasteful in case the spinlock doesn’t become available quickly.
notably missing from the POSIX threads API. An implementation of these
is provided by a different header file,
though there is little use for it in practice. Combining mutexes and
condition variables provides the same functionality and is typically
easier to use and understand.
An interesting feature of the POSIX threads API is that it also provides
support for using these synchronisation primitives across multiple
processes by placing them in shared memory.
This feature has to be enabled explicitly by setting the
attribute of every shared instance to
shared instances may be more expensive to use than private instances
(which are accessed only by threads belonging to the same process).
UNIX-like operating systems traditionally didn’t provide any support for
creating multiple kernel-level threads that share the same address
space. Initial implementations of the POSIX threads API (e.g., FreeBSD’s
libc_r) were designed to work entirely in userspace, having the
limitation that threads in a process cannot be scheduled across multiple
CPUs concurrently. Such an approach, called
meant that the kernel was oblivious of userspace synchronisation
primitives, as scheduling threads within a process was a responsibility
of the threading library.
Over the years, both Linux and FreeBSD switched to threading libraries that create full kernel-level threads for every user thread, a model that is called 1:1 threading. As the kernel is now responsible for scheduling individual threads, it now needs to have some awareness of synchronisation primitives. It needs to know which threads are blocked on locks, so their execution may be suspended. Once a contended lock is released, it also needs to know which thread can be woken up to acquire the lock next.
To realise this, the Linux kernel implements a feature called futexes (‘fast userspace mutexes’). What is novel about futexes is that they only require the kernel to do bookkeeping in case of lock contention. In the uncontended case, they exist solely in userspace and can be locked and unlocked by programs without performing any system calls, making them very fast. FreeBSD implements a similar feature, called umtx.
The code below shows how the futex API can be used to implement a very
simple mutex. The mutex is modeled as a tri-state, either being unlocked
UNLOCKED), locked without any threads waiting (
LOCKED) or locked
with threads waiting (
CONTENDED). In the uncontended case, the
my_mutex_lock() function switches the lock’s value from
LOCKED. In the contended case, it switches the lock’s value from
CONTENDED and calls into the kernel to put the thread to
sleep, using futex operation
FUTEX_WAIT. When that happens, the next
my_mutex_unlock() will use futex operation
wake up any sleeping threads.
To prevent races that could cause threads to miss wakeups that happen
after setting the lock to
CONTENDED and calling into the kernel, the
kernel double-checks the value of the futex right before going to sleep.
If the mutex doesn’t have the value
CONTENDED, futex operation
FUTEX_WAIT will return immediately.
FUTEX_WAKE operations can also be used to
implement a condition variable. One way to implement it would be to
model it as a simple counter. When a thread wants to wait on the
condition variable, it increments the counter to the next odd value and
FUTEX_WAIT. The broadcast function tests whether the counter has
an odd value. If so, it resets it to the next even value and wakes up
any waiting threads using
Now that we’ve looked at a very basic way of using futexes to implement our own synchronisation primitives, let’s discuss a couple of improvements that have been made to Linux’s futex system over time.
In order to provide reliable support for priority inheritance, the
kernel must know which thread owns a locked mutex, so that that thread’s
priority can be temporarily increased to that of the waiting thread that
has the highest priority. This is of course impossible with the simple
tri-state that we used above. To solve this, Linux added a couple of new
futex operations, called
FUTEX_WAKE, these operations require that the
futex variable is used in a specified way. When unlocked, the variable
contains the value zero. When locked, it contains the thread ID of the
owner, allowing the kernel to extract it and increase that thread’s
priority if needed. The top bit of the futex variable indicates whether
there are threads waiting. It may only be set by using
Once set, the lock may only be released using
The condition variable implementation that we’ve seen above has the nasty side-effect that if a condition variable has multiple waiters, a broadcast will wake them up all at the same time, only to discover that only a single thread can progress, as they all need to reacquire the same mutex. This is often called the thundering herd problem.
To prevent this from happening, it makes more sense to wake up threads
one by one, only scheduling them after the previous thread has unlocked
the mutex. This can be accomplished by using operations
FUTEX_CMP_REQUEUE, which can be used to wake up a single thread
and move the remainder from one futex (the condition variable) to
another (the mutex).
Requeueing threads to a mutex that uses priority inheritance
is supported as of Linux 2.6.31, using yet another set of futex
While designing CloudABI’s threading library, I reviewed existing implementations and was amazed by how over-engineered these seem to be. For example, Linux’s concept of building all synchronisation primitives on top of abstract wait and wake operations may have seemed sensible at the time, but practice has shown that this model adds a lot more complexity to the userspace threading library. Glibc’s NPTL is approximately 20,000 lines of code in size, while still having a couple of glaring imperfections:
FUTEX_REQUEUEcan only requeue threads to futexes of which the caller knows the address. This means that
FUTEX_REQUEUEfor process-shared condition variables, as the mutex may not even be mapped in the caller’s address space.
pthread_rwlock_twith a plain mutex, meaning that all operations are effectively serialised. Write-locking and unlocking an uncontested readers-writer lock takes 50% more time than for a mutex. Read-locking an uncontested lock seems even slower than write-locking. This makes readers-writer locks very unattractive to use.
For CloudABI, I’ve therefore decided to follow a more pragmatic approach that doesn’t use the wait/wake model. Instead, we use futexes that closely match the style of the POSIX synchronisation primitives. CloudABI’s futexes are typed. Right now there are two different futex types:
a readers-writer lock, also perfectly usable as a mutex.
When unlocked, the lock has value
CLOUDABI_LOCK_UNLOCKED. The top
two bits store whether the lock has waiters
CLOUDABI_LOCK_KERNEL_MANAGED) and whether the lock is acquired for
reading or writing (
CLOUDABI_LOCK_WRLOCKED). When locked for
reading, the bottom 30 bits contain the number of readers. When locked
for writing, the bottom 30 bits contain the thread ID of the owner.
This futex type does not support write recursion, as this may easily be implemented in userspace, using a separate counter stored next to the futex.
a condition variable.
The condition variable has value
and only if there are no threads waiting to be woken up.
It is possible to perform three blocking kernel-level operations on them:
lock_rdlock(cloudabi_lock_t *lock): acquires a lock for reading.
lock_wrlock(cloudabi_lock_t *lock): acquires a lock for writing.
condvar(cloudabi_condvar_t *condvar, cloudabi_lock_t *lock): waits on a condition variable, while temporarily unlocking a lock that has been locked for writing.
There are two non-blocking operations:
lock_unlock(cloudabi_lock_t *lock): unlocks a lock that has been locked for writing.
condvar_signal(cloudabi_condvar_t *condvar, uint32_t nwaiters): wakes up a specifed number of threads waiting on a condition variable.
This API reveals why I’m calling this approach hierarchical. Instead of
treating futexes as individual queues of threads (like on other
systems), it allows the kernel to construct a directed graph of locks,
each having zero or more condition variables pointing to it. When
condvar_signal() to wake up threads waiting on a condition
variable, they can be requeued to the lock without needing to provide
its address. This approach works even when the condition variable is
You may wonder why there is no operation for unlocking a lock that has
been locked for reading. The reason for this is that this can always be
done from userspace, except when the lock has threads waiting and its
last read lock is to be released. In that case, userspace first upgrades
the lock to a write lock before calling into
not only simplifies the implementation of
lock_unlock(). It also
prevents spurious calls, as the lock’s reader count can no longer be
incremented after upgrading.
Below is a list of relevant source files that provide full insight in how CloudABI’s futexes work:
While working this article, I discovered that writing it in such a way that it is both accessible, informative and not too long is quite challenging. The reason for this is that the topic at hand is already pretty hard. Still, I hope that you enjoyed reading this article and that it provided you with some more insight in the mysterious world of locking.
Be sure to get in touch at email@example.com to let me know what you thought of this article. I’d love to hear your feedback!