Nuxi The CloudABI Development Blog

CloudABI's POSIX threads support: hierarchical futexes

June 22, 2016 by Ed Schouten

It goes without saying that CloudABI supports running multi-threaded code. Like most UNIX-like systems out there, CloudABI implements the POSIX threads API, or pthreads for short. In today’s article, let’s focus on a single, yet crucial aspect of this API, namely synchronisation primitives (e.g., mutexes).

In this article we will discuss how the synchronisation primitives provided by the POSIX threads API are implemented on operating systems like Linux and FreeBSD. Later on we’ll see how CloudABI implements these and how that differs from other systems, making use of something that I call hierarchical futexes. Let’s start off with an overview of the POSIX threads API.

An overview of the POSIX synchronisation primitives

The most commonly used synchronisation primitive provided by the POSIX threads API is the mutex, having type pthread_mutex_t. By default, mutexes are not recursive, nor do they perform any priority inheritance. These features can be enabled separately by using the pthread_mutexattr_*() functions to alter the mutex’s initialisation attributes.

In essence, it is possible to perform two operations on these mutexes, namely to lock and unlock them. A mutex may only be unlocked by the thread that has it locked (its owner). The API for these two operations is as follows (disregarding features such as timeouts, try-locking, etc.):

int pthread_mutex_lock(pthread_mutex_t *);
int pthread_mutex_unlock(pthread_mutex_t *);

In addition to plain mutexes, the POSIX threads API also provides support for readers-writer locks, having type pthread_rwlock_t. Where these locks differ from mutexes is that they perform better for protecting shared datastructures that are only modified infrequently. They either allow multiple readers or a single writer to acquire the lock, hence the name.

There are some features that readers-writer locks lack in comparison to plain mutexes. Though they can be acquired for reading recursively, there is no support for write recursion. Implementations are also not required to implement any priority inheritance scheme, meaning that starvation may occur easily. For this reason, write locks should preferably only be acquired infrequently and as briefly as possible.

Readers-writer locks have three operations, namely read-locking, write-locking and unlocking. The unlocking function automatically determines whether the lock should be unlocked for reading or writing. The API for these three operations is as follows:

int pthread_rwlock_rdlock(pthread_rwlock_t *);
int pthread_rwlock_unlock(pthread_rwlock_t *);
int pthread_rwlock_wrlock(pthread_rwlock_t *);

Another important primitive provided by the POSIX threads API is pthread_cond_t, which implements a condition variable. Condition variables may be used by a thread to temporarily release a mutex to wait for an event to occur. Other threads can signal the condition variable to wake up a single waiter. It is also possible to wake up all waiters at once by broadcasting. POSIX requires that if a condition variable has multiple waiters, all of them must have released the same mutex. Its API is as follows:

int pthread_cond_broadcast(pthread_cond_t *);
int pthread_cond_signal(pthread_cond_t *);
int pthread_cond_wait(pthread_cond_t *, pthread_mutex_t *);

The POSIX threads API also provides some additional synchronisation primitives which we will only describe briefly:

Semaphores are notably missing from the POSIX threads API. An implementation of these is provided by a different header file, <semaphore.h>, though there is little use for it in practice. Combining mutexes and condition variables provides the same functionality and is typically easier to use and understand.

An interesting feature of the POSIX threads API is that it also provides support for using these synchronisation primitives across multiple processes by placing them in shared memory. This feature has to be enabled explicitly by setting the pshared attribute of every shared instance to PTHREAD_PROCESS_SHARED, as shared instances may be more expensive to use than private instances (which are accessed only by threads belonging to the same process).

Implementing these primitives efficiently: futexes

UNIX-like operating systems traditionally didn’t provide any support for creating multiple kernel-level threads that share the same address space. Initial implementations of the POSIX threads API (e.g., FreeBSD’s libc_r) were designed to work entirely in userspace, having the limitation that threads in a process cannot be scheduled across multiple CPUs concurrently. Such an approach, called N:1 threading, meant that the kernel was oblivious of userspace synchronisation primitives, as scheduling threads within a process was a responsibility of the threading library.

Over the years, both Linux and FreeBSD switched to threading libraries that create full kernel-level threads for every user thread, a model that is called 1:1 threading. As the kernel is now responsible for scheduling individual threads, it now needs to have some awareness of synchronisation primitives. It needs to know which threads are blocked on locks, so their execution may be suspended. Once a contended lock is released, it also needs to know which thread can be woken up to acquire the lock next.

To realise this, the Linux kernel implements a feature called futexes (‘fast userspace mutexes’). What is novel about futexes is that they only require the kernel to do bookkeeping in case of lock contention. In the uncontended case, they exist solely in userspace and can be locked and unlocked by programs without performing any system calls, making them very fast. FreeBSD implements a similar feature, called umtx.

The code below shows how the futex API can be used to implement a very simple mutex. The mutex is modeled as a tri-state, either being unlocked (UNLOCKED), locked without any threads waiting (LOCKED) or locked with threads waiting (CONTENDED). In the uncontended case, the my_mutex_lock() function switches the lock’s value from UNLOCKED to LOCKED. In the contended case, it switches the lock’s value from LOCKED to CONTENDED and calls into the kernel to put the thread to sleep, using futex operation FUTEX_WAIT. When that happens, the next call to my_mutex_unlock() will use futex operation FUTEX_WAKE to wake up any sleeping threads.

To prevent races that could cause threads to miss wakeups that happen after setting the lock to CONTENDED and calling into the kernel, the kernel double-checks the value of the futex right before going to sleep. If the mutex doesn’t have the value CONTENDED, futex operation FUTEX_WAIT will return immediately.

#include <linux/futex.h>
#include <sys/syscall.h>
#include <limits.h>
#include <stdatomic.h>
#include <unistd.h>

typedef enum {
  UNLOCKED = 0,  // Lock is currently unlocked.
  LOCKED = 1,    // Lock has been acquired.
  CONTENDED = 2, // Lock has been acquired and has waiting threads.
} my_mutex_t;

void my_mutex_init(my_mutex_t *mutex) { atomic_init(mutex, UNLOCKED); }

void my_mutex_lock(my_mutex_t *mutex) {
  my_mutex_t old = UNLOCKED;
  for (;;) {
    if (old == UNLOCKED) {
      // Lock is unlocked. Attempt to acquire it.
      if (atomic_compare_exchange_weak(mutex, &old, LOCKED))
    } else if (old == CONTENDED ||
               atomic_compare_exchange_weak(mutex, &old, CONTENDED)) {
      // Lock is locked and set to contended. Put thread to sleep.
      syscall(SYS_futex, mutex, FUTEX_WAIT, CONTENDED, NULL, NULL, 0);

void my_mutex_unlock(my_mutex_t *mutex) {
  // Release the lock. Wake up waiting threads if contended.
  if (atomic_exchange(mutex, UNLOCKED) == CONTENDED)
    syscall(SYS_futex, mutex, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);

The FUTEX_WAIT and FUTEX_WAKE operations can also be used to implement a condition variable. One way to implement it would be to model it as a simple counter. When a thread wants to wait on the condition variable, it increments the counter to the next odd value and calls FUTEX_WAIT. The broadcast function tests whether the counter has an odd value. If so, it resets it to the next even value and wakes up any waiting threads using FUTEX_WAKE.

typedef atomic_int my_cond_t;

void my_cond_init(my_cond_t *cond) { atomic_init(cond, 0); }

void my_cond_wait(my_cond_t *cond, my_mutex_t *mutex) {
  // Bump the counter to an odd value to indicate there are waiters.
  int v = atomic_fetch_or(cond, 1) | 1;

  // Wait for broadcast while having the lock released.
  syscall(SYS_futex, cond, FUTEX_WAIT, v, NULL, NULL, 0);

void my_cond_broadcast(my_cond_t *cond) {
  int v = atomic_load(cond);
  while (v % 2 != 0) {
    // Condition variable has waiters. Bump the counter to an even value
    // again and wake up the waiters.
    if (atomic_compare_exchange_weak(cond, &v, v + 1)) {
      syscall(SYS_futex, cond, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);

Now that we’ve looked at a very basic way of using futexes to implement our own synchronisation primitives, let’s discuss a couple of improvements that have been made to Linux’s futex system over time.

Improvement #1: priority inheritance support

In order to provide reliable support for priority inheritance, the kernel must know which thread owns a locked mutex, so that that thread’s priority can be temporarily increased to that of the waiting thread that has the highest priority. This is of course impossible with the simple tri-state that we used above. To solve this, Linux added a couple of new futex operations, called FUTEX_LOCK_PI and FUTEX_UNLOCK_PI.

Unlike FUTEX_WAIT and FUTEX_WAKE, these operations require that the futex variable is used in a specified way. When unlocked, the variable contains the value zero. When locked, it contains the thread ID of the owner, allowing the kernel to extract it and increase that thread’s priority if needed. The top bit of the futex variable indicates whether there are threads waiting. It may only be set by using FUTEX_LOCK_PI. Once set, the lock may only be released using FUTEX_UNLOCK_PI.

Improvement #2: solving the thundering herd problem

The condition variable implementation that we’ve seen above has the nasty side-effect that if a condition variable has multiple waiters, a broadcast will wake them up all at the same time, only to discover that only a single thread can progress, as they all need to reacquire the same mutex. This is often called the thundering herd problem.

To prevent this from happening, it makes more sense to wake up threads one by one, only scheduling them after the previous thread has unlocked the mutex. This can be accomplished by using operations FUTEX_REQUEUE and FUTEX_CMP_REQUEUE, which can be used to wake up a single thread and move the remainder from one futex (the condition variable) to another (the mutex). Requeueing threads to a mutex that uses priority inheritance is supported as of Linux 2.6.31, using yet another set of futex operations.

CloudABI’s hierarchical futexes

While designing CloudABI’s threading library, I reviewed existing implementations and was amazed by how over-engineered these seem to be. For example, Linux’s concept of building all synchronisation primitives on top of abstract wait and wake operations may have seemed sensible at the time, but practice has shown that this model adds a lot more complexity to the userspace threading library. Glibc’s NPTL is approximately 20,000 lines of code in size, while still having a couple of glaring imperfections:

For CloudABI, I’ve therefore decided to follow a more pragmatic approach that doesn’t use the wait/wake model. Instead, we use futexes that closely match the style of the POSIX synchronisation primitives. CloudABI’s futexes are typed. Right now there are two different futex types:

It is possible to perform three blocking kernel-level operations on them:

There are two non-blocking operations:

This API reveals why I’m calling this approach hierarchical. Instead of treating futexes as individual queues of threads (like on other systems), it allows the kernel to construct a directed graph of locks, each having zero or more condition variables pointing to it. When calling condvar_signal() to wake up threads waiting on a condition variable, they can be requeued to the lock without needing to provide its address. This approach works even when the condition variable is process-shared.

You may wonder why there is no operation for unlocking a lock that has been locked for reading. The reason for this is that this can always be done from userspace, except when the lock has threads waiting and its last read lock is to be released. In that case, userspace first upgrades the lock to a write lock before calling into lock_unlock(). This not only simplifies the implementation of lock_unlock(). It also prevents spurious calls, as the lock’s reader count can no longer be incremented after upgrading.

Below is a list of relevant source files that provide full insight in how CloudABI’s futexes work:

Closing words

While working this article, I discovered that writing it in such a way that it is both accessible, informative and not too long is quite challenging. The reason for this is that the topic at hand is already pretty hard. Still, I hope that you enjoyed reading this article and that it provided you with some more insight in the mysterious world of locking.

Be sure to get in touch at to let me know what you thought of this article. I’d love to hear your feedback!