May 10, 2016 by Ed Schouten
Welcome back to the last article of our three-part series on how we managed to port CloudABI over to Mac OS X. Today I’m going to discuss how thread-local storage (TLS) works. I’ll describe how it works on ELF-based systems in general, followed by explaining how CloudABI executables and the emulator for Mac OS X implement this. But first, a short history lesson.
Back in the old days, the only way on UNIX to execute multiple tasks at
the same time, was to fork your process.
Every task (process) would run in its own address space. Only in 1997,
with the release of the Single UNIX Specification v2 (SUSv2),
UNIX gained support for multi-threading, allowing you to create multiple
threads of execution that share the same address space, using
What was problematic about adding multi-threading to an already existing
system, was that many programming interfaces had been developed that
were not thread-safe. For example, functions like
were designed to keep track of state through global variables, instead
of storing it in space provided by the caller. This was corrected by
adding new variants of them to the standard, called
strtok_r(). CloudABI’s C library only implements the latter.
Another problematic instance of global state was the
which was traditionally declared as
a global variable in
To make access to
errno thread-safe, SUSv2 changed its description
to allow it to be declared as a macro. On FreeBSD,
you see that using
errno now generates a function call to
fetching the address of a per-thread instance. This way of having a
per-thread variable instance is in no way exclusive to
errno. You can
create your own per-thread variables using
whose instance of the current thread can be accessed using
GCC 3.3 added support for declaring per-thread variables directly, using
The advantage of
pthread_key_create() is that it’s a
lot less laborious to use. The disadvantage of
__thread is that every
per-thread variable needs to be allocated for every running thread,
regardless of whether they are actually used. This is because
assignments to them may not fail due to a lack of memory, which may
In 2011, the C and C++ working groups released new versions of their
standards that provide support for TLS, similar to GCC’s
C++11, the keyword is called
In C11, it’s called
but a macro in
allows you to use the C++ keyword in C code as well.
Let’s take a look at how a thread is capable of accessing its own TLS variables. We’ll first look into how this works on ARM64, as it is in my opinion slightly easier to understand than x86-64. Consider the following simple C program that does nothing more than assign a value to a variable stored in TLS:
When this piece of code is compiled and linked, the linker comes up with
a fixed, contiguous layout in which all of the thread-local variables in
the executable are stored. If we take a look at the symbol table of our
executable using the
readelf utility, we can see that the linker
decided to place our variable
i at offset 0x14 (decimal: 20) relative
to the starting address of the buffer containing the thread’s TLS
$ aarch64-unknown-cloudabi-cc -O2 -o tls tls.c $ aarch64-unknown-cloudabi-readelf -s tls Num: Value Size Type Bind Vis Ndx Name ... 290: 0000000000000014 4 TLS GLOBAL DEFAULT 11 i ...
Below is the machine code of our
main() function, to which I’ve added
some comments on the right:
$ aarch64-unknown-cloudabi-objdump -d tls ... main: 4e08: 48 d0 3b d5 mrs x8, TPIDR_EL0 # Get TLS base address. 4e0c: 08 01 00 91 add x8, x8, #0 # Add offset(i) / 4096. 4e10: 08 91 00 91 add x8, x8, #36 # Add offset(i) % 4096. 4e14: 49 9a 80 52 movz w9, #0x4d2 # Load value 1234. 4e18: e0 03 1f 2a mov w0, wzr # Set return value to zero. 4e1c: 09 01 00 b9 str w9, [x8] # Store 1234 in i. 4e20: c0 03 5f d6 ret # Return. ...
On ARM64, the base address of the TLS area is stored in a dedicated
which is loaded into 64-bits register
x8 on the first line. As
ARM allows you to only use 12-bit integer immediate values,
the load is followed by two
add instructions for adding the TLS
variable’s offset to
x8. The first instruction adds multiples of 2¹² =
4096 bytes to the base address, while the second instruction is used
to add the remainder. This means that using this construct, we can
declare up to 2²⁴ = 16 MB of TLS variables.
If you’ve been paying attention, you may have noticed that the machine
code above adds a different offset to the TLS base address than what was
readelf (36 vs. 20). The reason for this is that the
linker keeps the first 16 bytes of the TLS area, called the Thread
Control Block (TCB), reserved. The operating system can use the TCB to
store additional thread-specific information, such as a reference to its
pthread_t. This approach is typically called ‘TLS Variant I’.
Figure 1: TLS Variant I on ARM64
On x86-64, we see that TLS works similarly to how it’s done on ARM64, with two noticeable differences. First of all, x86 predates the concept of thread-local storage, meaning it never gained a dedicated register for this purpose. You therefore see that most operating systems reuse one of the x86 segment registers for this purpose. ELF-based systems and 32-bit Windows use segment register FS, while Mac OS X and 64-bit Windows use GS.
A downside of using segment registers for this purpose is that the base
address is opaque to userspace programs. This means that if FS or GS
were to be used directly, you could do loads and stores on TLS
variables, but not create references to them (i.e.,
&i). To solve
this, most ELF-based systems add an extra level of indirection, where
index zero of the memory segment is a pointer to the TCB.
Windows has given a meaning to some of the other indices,
but most other systems have not. Modern CPUs now have
instructions to read the FS/GS base directly,
but they need to be enabled by the operating system explicitly, which is
typically not the case.
The second difference compared to ARM64 is that in order to remove the 16-byte size limitation on the TCB, TLS variables are placed before the TCB. They can be addressed by subtracting their offset from the base address. This approach, typically called ‘TLS Variant II’, allows an operating system to make the TCB as big as it wants.
Figure 2: TLS Variant II on x86-64
Here’s what the machine code of our
main() function looks like on
x86-64. The first instruction loads the address of the TCB stored in
%rax. The second instruction subtracts the offset of
%rax and stores the desired value of 1234 at that location.
$ x86_64-unknown-cloudabi-cc -O2 -fomit-frame-pointer -o tls tls.c $ x86_64-unknown-cloudabi-objdump -d tls ... main: 7150: 64 48 8b .. movq %fs:0, %rax # Get the TLS base address. 7159: c7 80 ec .. movl $1234, -20(%rax) # Store 1234 in i. 7163: 31 c0 xorl %eax, %eax # Set return value to zero. 7165: c3 retq # Return. ...
Now that we’ve seen how programs can access TLS variables, let’s discuss how CloudABI sets up TLS in such a way that it’s friendly towards emulation, focussing on x86-64.
When a CloudABI executable is run in a userspace emulator, like on Mac OS X, we often need to switch back and forth between code that is part of the CloudABI executable and native code that’s part of the emulator. This happens when the CloudABI executable invokes a system call, for example. It’s quite important that we also switch between TLS areas in those cases, as we wouldn’t want the emulator to accidentally overwrite the TLS variables of the executable, or vice versa. As system calls tend to happen a lot, it is critical that this switch is done as efficiently as possible.
What you see on operating systems like Linux and FreeBSD is that
programs want to have full ownership over the FS base address. When a
program or additional thread starts up, it allocates and initializes a
TLS area and a memory segment to point to the TLS area. It then updates
the FS base to point to the memory segment, so that
%fs:0 points to
the TLS area. What is problematic about this approach is that because
the FS base address is not modifiable by userspace programs, it must
invoke special system calls such as
to let the kernel do this on the program’s behalf. If a CloudABI
executable would also want to have full control over the FS base
address, we would need to invoke this system call every time the
emulator switches execution between the CloudABI executable and the
emulator, which would be horrible from a performance point-of-view.
To solve this, CloudABI programs no longer try to adjust the FS base
address. Instead, they only rely on having an
%fs:0 to which they can
freely write. They simply assume that the kernel or emulator has
performed the necessary steps to set up the FS base properly. The
emulator now only needs to switch the value of
%fs:0, instead of the
FS base address.
The next problem to solve is where the emulator can store the original
%fs:0, so that it can restore the host system’s TLS area when
entering a system call. This has been solved by requiring that the
CloudABI program reserves a small part at the start of its TCB, which
the emulator can use as storage. The CloudABI program can safely replace
its own TLS area at any time, as long as it copies over this reserved
space to the new TLS area.
To make this a bit easier to understand, I’ve made a couple of illustrations that demonstrate the TLS layout both at program startup and after startup has finished:
Figure 3: TLS layout as set up by the emulator at process startup
Figure 4: TLS layout after being resized by the CloudABI program
While I was trying to get the CloudABI emulator to work on Mac OS X, I realized that the approach that I sketched above still had one flaw: it doesn’t work if the FS base address is set to an invalid address and cannot be modified. It turns out that this is the case on Mac OS X, which uses GS for thread-local storage. The XNU kernel doesn’t seem to expose any API for adjusting the FS base address.
To work around this without taking a large performance hit, we’ve implemented a hack that allows us to translate any access to FS to use GS on the fly. It works by setting up a signal handler for segmentation faults. Inside this signal handler, we test whether the instruction that caused the segmentation fault was trying to access FS. This is fairly easy to test, as instructions that use FS always start with a 0x64 byte. If a matching instruction is found, we simply alter the instruction to start with 0x65 instead and return from the signal handler, causing the execution of the instruction to be restarted. This time it will access GS instead of FS.
What is nice about this approach is that it only causes a signal to be
generated the first time a specific piece of code is executed. Once the
code has been patched up, it will access GS immediately and no longer
cause segmentation faults. The downside is that
all executable memory must now also be mapped for writing.
Our goal is to have this hack removed from the emulator as soon as Mac
OS X gains support for using
This brings us to the end of this article, and the end of this series, in which I’ve tried to highlight some of the technicalities of getting CloudABI to work on Mac OS X. If you’re interested in knowing more about how thread-local storage works, especially in combination with shared libraries, I can strongly recommend Ulrich Drepper’s extensive paper on the matter.
In the meantime, stay tuned for my next article, in which I’ll likely talk about how multi-threading and locking primitives (mutexes, condition variables and semaphores) are implemented.