Nuxi The CloudABI Development Blog

CloudABI for Mac OS X, part three: thread-local storage

May 10, 2016 by Ed Schouten

Welcome back to the last article of our three-part series on how we managed to port CloudABI over to Mac OS X. Today I’m going to discuss how thread-local storage (TLS) works. I’ll describe how it works on ELF-based systems in general, followed by explaining how CloudABI executables and the emulator for Mac OS X implement this. But first, a short history lesson.

The history of thread-local storage in C and C++

Back in the old days, the only way on UNIX to execute multiple tasks at the same time, was to fork your process. Every task (process) would run in its own address space. Only in 1997, with the release of the Single UNIX Specification v2 (SUSv2), UNIX gained support for multi-threading, allowing you to create multiple threads of execution that share the same address space, using pthread_create() declared in <pthread.h>.

What was problematic about adding multi-threading to an already existing system, was that many programming interfaces had been developed that were not thread-safe. For example, functions like localtime() and strtok() were designed to keep track of state through global variables, instead of storing it in space provided by the caller. This was corrected by adding new variants of them to the standard, called localtime_r() and strtok_r(). CloudABI’s C library only implements the latter.

Another problematic instance of global state was the errno variable, which was traditionally declared as a global variable in <errno.h>. To make access to errno thread-safe, SUSv2 changed its description to allow it to be declared as a macro. On FreeBSD, you see that using errno now generates a function call to __error(), fetching the address of a per-thread instance. This way of having a per-thread variable instance is in no way exclusive to errno. You can create your own per-thread variables using pthread_key_create(), whose instance of the current thread can be accessed using pthread_getspecific() and pthread_setspecific().

GCC 3.3 added support for declaring per-thread variables directly, using the __thread keyword. The advantage of __thread over pthread_key_create() is that it’s a lot less laborious to use. The disadvantage of __thread is that every per-thread variable needs to be allocated for every running thread, regardless of whether they are actually used. This is because assignments to them may not fail due to a lack of memory, which may happen with pthread_setspecific().

In 2011, the C and C++ working groups released new versions of their standards that provide support for TLS, similar to GCC’s __thread. In C++11, the keyword is called thread_local. In C11, it’s called _Thread_local, but a macro in <threads.h> allows you to use the C++ keyword in C code as well.

Accessing variables stored in TLS on ARM64

Let’s take a look at how a thread is capable of accessing its own TLS variables. We’ll first look into how this works on ARM64, as it is in my opinion slightly easier to understand than x86-64. Consider the following simple C program that does nothing more than assign a value to a variable stored in TLS:

_Thread_local int i;

int main() {
  i = 1234;
  return 0;
}

When this piece of code is compiled and linked, the linker comes up with a fixed, contiguous layout in which all of the thread-local variables in the executable are stored. If we take a look at the symbol table of our executable using the readelf utility, we can see that the linker decided to place our variable i at offset 0x14 (decimal: 20) relative to the starting address of the buffer containing the thread’s TLS variables.

$ aarch64-unknown-cloudabi-cc -O2 -o tls tls.c
$ aarch64-unknown-cloudabi-readelf -s tls
   Num:    Value          Size Type    Bind   Vis      Ndx Name
...
   290: 0000000000000014     4 TLS     GLOBAL DEFAULT   11 i
...

Below is the machine code of our main() function, to which I’ve added some comments on the right:

$ aarch64-unknown-cloudabi-objdump -d tls
...
main:
   4e08:   48 d0 3b d5   mrs    x8, TPIDR_EL0     # Get TLS base address.
   4e0c:   08 01 00 91   add    x8, x8, #0        # Add offset(i) / 4096.
   4e10:   08 91 00 91   add    x8, x8, #36       # Add offset(i) % 4096.
   4e14:   49 9a 80 52   movz   w9, #0x4d2        # Load value 1234.
   4e18:   e0 03 1f 2a   mov    w0, wzr           # Set return value to zero.
   4e1c:   09 01 00 b9   str    w9, [x8]          # Store 1234 in i.
   4e20:   c0 03 5f d6   ret                      # Return.
...

On ARM64, the base address of the TLS area is stored in a dedicated register called TPIDR_EL0, which is loaded into 64-bits register x8 on the first line. As ARM allows you to only use 12-bit integer immediate values, the load is followed by two add instructions for adding the TLS variable’s offset to x8. The first instruction adds multiples of 2¹² = 4096 bytes to the base address, while the second instruction is used to add the remainder. This means that using this construct, we can declare up to 2²⁴ = 16 MB of TLS variables.

If you’ve been paying attention, you may have noticed that the machine code above adds a different offset to the TLS base address than what was reported by readelf (36 vs. 20). The reason for this is that the linker keeps the first 16 bytes of the TLS area, called the Thread Control Block (TCB), reserved. The operating system can use the TCB to store additional thread-specific information, such as a reference to its own pthread_t. This approach is typically called ‘TLS Variant I’.

TLS Variant I on ARM64

Figure 1: TLS Variant I on ARM64

Accessing variables stored in TLS on x86-64

On x86-64, we see that TLS works similarly to how it’s done on ARM64, with two noticeable differences. First of all, x86 predates the concept of thread-local storage, meaning it never gained a dedicated register for this purpose. You therefore see that most operating systems reuse one of the x86 segment registers for this purpose. ELF-based systems and 32-bit Windows use segment register FS, while Mac OS X and 64-bit Windows use GS.

A downside of using segment registers for this purpose is that the base address is opaque to userspace programs. This means that if FS or GS were to be used directly, you could do loads and stores on TLS variables, but not create references to them (i.e., &i). To solve this, most ELF-based systems add an extra level of indirection, where index zero of the memory segment is a pointer to the TCB. Windows has given a meaning to some of the other indices, but most other systems have not. Modern CPUs now have instructions to read the FS/GS base directly, but they need to be enabled by the operating system explicitly, which is typically not the case.

The second difference compared to ARM64 is that in order to remove the 16-byte size limitation on the TCB, TLS variables are placed before the TCB. They can be addressed by subtracting their offset from the base address. This approach, typically called ‘TLS Variant II’, allows an operating system to make the TCB as big as it wants.

TLS Variant II on x86-64

Figure 2: TLS Variant II on x86-64

Here’s what the machine code of our main() function looks like on x86-64. The first instruction loads the address of the TCB stored in %fs:0 into %rax. The second instruction subtracts the offset of i from %rax and stores the desired value of 1234 at that location.

$ x86_64-unknown-cloudabi-cc -O2 -fomit-frame-pointer -o tls tls.c
$ x86_64-unknown-cloudabi-objdump -d tls
...
main:
   7150:   64 48 8b ..   movq   %fs:0, %rax       # Get the TLS base address.
   7159:   c7 80 ec ..   movl   $1234, -20(%rax)  # Store 1234 in i.
   7163:   31 c0         xorl   %eax, %eax        # Set return value to zero.
   7165:   c3            retq                     # Return.
...

Emulator-friendly TLS

Now that we’ve seen how programs can access TLS variables, let’s discuss how CloudABI sets up TLS in such a way that it’s friendly towards emulation, focussing on x86-64.

When a CloudABI executable is run in a userspace emulator, like on Mac OS X, we often need to switch back and forth between code that is part of the CloudABI executable and native code that’s part of the emulator. This happens when the CloudABI executable invokes a system call, for example. It’s quite important that we also switch between TLS areas in those cases, as we wouldn’t want the emulator to accidentally overwrite the TLS variables of the executable, or vice versa. As system calls tend to happen a lot, it is critical that this switch is done as efficiently as possible.

What you see on operating systems like Linux and FreeBSD is that programs want to have full ownership over the FS base address. When a program or additional thread starts up, it allocates and initializes a TLS area and a memory segment to point to the TLS area. It then updates the FS base to point to the memory segment, so that %fs:0 points to the TLS area. What is problematic about this approach is that because the FS base address is not modifiable by userspace programs, it must invoke special system calls such as arch_prctl() or amd64_set_fsbase() to let the kernel do this on the program’s behalf. If a CloudABI executable would also want to have full control over the FS base address, we would need to invoke this system call every time the emulator switches execution between the CloudABI executable and the emulator, which would be horrible from a performance point-of-view.

To solve this, CloudABI programs no longer try to adjust the FS base address. Instead, they only rely on having an %fs:0 to which they can freely write. They simply assume that the kernel or emulator has performed the necessary steps to set up the FS base properly. The emulator now only needs to switch the value of %fs:0, instead of the FS base address.

The next problem to solve is where the emulator can store the original value of %fs:0, so that it can restore the host system’s TLS area when entering a system call. This has been solved by requiring that the CloudABI program reserves a small part at the start of its TCB, which the emulator can use as storage. The CloudABI program can safely replace its own TLS area at any time, as long as it copies over this reserved space to the new TLS area.

To make this a bit easier to understand, I’ve made a couple of illustrations that demonstrate the TLS layout both at program startup and after startup has finished:

TLS layout as set up by the emulator at process startup

Figure 3: TLS layout as set up by the emulator at process startup

TLS layout after being resized by the application

Figure 4: TLS layout after being resized by the CloudABI program

The signal handler hack

While I was trying to get the CloudABI emulator to work on Mac OS X, I realized that the approach that I sketched above still had one flaw: it doesn’t work if the FS base address is set to an invalid address and cannot be modified. It turns out that this is the case on Mac OS X, which uses GS for thread-local storage. The XNU kernel doesn’t seem to expose any API for adjusting the FS base address.

To work around this without taking a large performance hit, we’ve implemented a hack that allows us to translate any access to FS to use GS on the fly. It works by setting up a signal handler for segmentation faults. Inside this signal handler, we test whether the instruction that caused the segmentation fault was trying to access FS. This is fairly easy to test, as instructions that use FS always start with a 0x64 byte. If a matching instruction is found, we simply alter the instruction to start with 0x65 instead and return from the signal handler, causing the execution of the instruction to be restarted. This time it will access GS instead of FS.

What is nice about this approach is that it only causes a signal to be generated the first time a specific piece of code is executed. Once the code has been patched up, it will access GS immediately and no longer cause segmentation faults. The downside is that all executable memory must now also be mapped for writing. Our goal is to have this hack removed from the emulator as soon as Mac OS X gains support for using WRFSBASE.

Closing words

This brings us to the end of this article, and the end of this series, in which I’ve tried to highlight some of the technicalities of getting CloudABI to work on Mac OS X. If you’re interested in knowing more about how thread-local storage works, especially in combination with shared libraries, I can strongly recommend Ulrich Drepper’s extensive paper on the matter.

In the meantime, stay tuned for my next article, in which I’ll likely talk about how multi-threading and locking primitives (mutexes, condition variables and semaphores) are implemented.

Be sure to send an email to info@nuxi.nl or send me a message on Twitter at @EdSchouten to let me know what you thought of this article. I’d love to hear your feedback!