Nuxi The CloudABI Development Blog

CloudABI for Mac OS X, part two: efficiently emulating system calls

April 28, 2016 by Ed Schouten

Welcome back to the second article of my three-part series on how we managed to port CloudABI over to Mac OS X. In the first blog post we looked at how the emulator is capable of loading Position Independent Executables into memory and start their execution. Today we’re going to look at a different aspect of emulation, namely how the emulator can efficiently capture system calls that are performed by the program.

How do system calls work traditionally?

System calls are actions that are executed by the operating system on behalf of programs, as they can for some reason not be performed by the program itself. For example, programs can easily perform string operations and mathematical operations (strlen(), sin()) without performing any system calls at all. Functions that perform disk and network I/O or manage the lifetime of processes (write(), sendmsg(), fork(), _Exit()) must typically be implemented using a system call.

What makes system calls different from regular function calls is that due to the memory management that modern operating systems perform, a system call cannot be started by doing a simple function call straight into the kernel. The kernel is not visible within the program’s address space. Instead, the C library provides tiny wrapper functions that use a special hardware instruction to force the CPU to switch to kernel mode. On my FreeBSD x86-64 system, such a wrapper function looks like this:

$ objdump -d /lib/libc.so.7
...
0000000000141700 <open>:
  141700:       b8 05 00 00 00          mov    $0x5,%eax
  141705:       49 89 ca                mov    %rcx,%r10
  141708:       0f 05                   syscall
  14170a:       0f 82 f4 5c 00 00       jb     147404 <cerror>
  141710:       c3                      retq
...

To allow the kernel to distinguish between system calls, the C library places a number that uniquely identifiers a system call in %eax. For open(), it uses five, which matches up with the fifth entry in FreeBSD’s system call table. On systems that use the System V ABI for x86-64, the calling convention for system calls differs slightly from regular function calls in that the fourth argument is not stored in %rcx, but %r10. The process switches to kernel mode using the syscall instruction.

Linux and the BSDs differ in the way the return value of a system call is encoded. On Linux, if a system call succeeds, %rax contains the return value of system call. If a system call fails, %rax contains a negated error number (e.g., -EINVAL). To distinguish between these cases, the wrapper functions need to test whether the return value lies between -4095 and -1. The BSDs use a different approach, where the error number is not negated, but the carry flag is used to indicate whether the system call has failed. If the carry flag is set, the wrapper function jumps to cerror, which copies the error number to errno and makes the system call return -1.

When CloudABI was developed initially, we used an approach similar to Linux and the BSDs where the C library contains system call wrappers that make use of the syscall instruction. The downside of this approach is that without using complex emulation techniques like dynamic recompilation, there is no easy way to capture the use of this instruction. Any system calls performed by the program would get sent to the kernel of Mac OS X, instead of being directed to the emulator. To solve this, we now make use of something called virtual dynamically linked shared objects (vDSO).

Introducing the vDSO

About a decade ago, Linux added various optimizations to the way processes can obtain the time of day. For functions like time(), it makes little sense to switch to kernel mode for every individual call, as the timestamp returned by this function only changes once per second. Instead, the kernel can store a cached value of the time of day in a page of memory that can be accessed by programs directly in case it has no strong requirements on the timestamp’s precision.

To prevent strong coupling between the C library and the layout of the shared page, the kernel also exposes a shared library to the process called the vDSO. This library contains a couple of functions such as __vdso_clock_gettime(), __vdso_gettimeofday() and __vdso_time() that can be used to extract the timestamp from the shared page. As long as programs only access the shared page through these functions, the layout of the shared page can be modified freely without breaking compatibility.

Programs make use of the vDSO by being linked against a library called linux-vdso.so.1. This library can of course nowhere be found on your file system, as it is provided directly by the kernel. The memory address at which the vDSO is mapped into the address space of the program is stored in the AT_SYSINFO_EHDR entry of the auxiliary vector that is passed to the application on startup, so that it can be used by the run-time linker to resolve any calls to one of its symbols.

$ ldd /bin/ls
        linux-vdso.so.1 =>  (0x00007ffc2bd97000)
        ...
$ locate linux-vdso.so.1 | wc -l
0

Due to the fact that the vDSO depends on the presence of a run-time linker to have its functions linked into the program, you see that the vDSO can only be used by dynamically linked programs. This explains why calls to time() from within a dynamically linked program are approximately ten times as fast as from within a statically linked program.

$ cat gettime.c
#include <time.h>
int main() {
  for (int i = 0; i < 100000000; ++i)
    time(NULL);
}
$ cc -o gettime gettime.c
$ time ./gettime
0.34user 0.00system 0:00.34elapsed ...
$ cc -o gettime gettime.c -static
$ time ./gettime
1.20user 2.12system 0:03.32elapsed ...

As of 2012, FreeBSD also uses a shared page to optimize access to its clocks. Its address is provided to the application using an AT_TIMEKEEP entry stored in the auxiliary vector. Though its data types are stored in a header called <sys/vdso.h> and its functions are prefixed with __vdso_, it is technically speaking not a vDSO, as the kernel does not expose it in the form of a shared library. The C library accesses the shared page’s contents directly.

CloudABI’s use of the vDSO

The idea behind CloudABI’s use of the vDSO is to generalize the approach that Linux uses. Instead of letting it only provide functions to obtain the time of day, our vDSO can be used to provide implementations of any system call. That way, a program that is emulated never needs to use the syscall instruction. It can always use plain function calls to communicate with its environment.

First, let’s take a look at how this is implemented inside of the emulator. Prior to starting our CloudABI program, we craft an ELF shared library on the stack. This shared library consists of an ELF header, followed by a program header and a number of dynamic sections. The dynamic sections point to a symbol table that contains one symbol for every system call. All system call names are prefixed with cloudabi_sys_. Just like on Linux, the address of this shared library is provided to the program through the AT_SYSINFO_EHDR entry in the auxiliary vector.

On the side of the application we see that our implementation is a lot more minimalistic than on Linux. As CloudABI applications are always statically linked, there is no way to link to the symbols provided by the vDSO natively. Instead, we create a global structure that contains one function pointer for every system call. During startup, the link_vdso() function scans the symbol table of the vDSO and copies the addresses of the vDSO’s functions into the global structure. The C library can then invoke these functions by using wrapper functions that make use of the global structure.

Right now only the userspace emulator makes use of the vDSO. The in-kernel support provided by FreeBSD and Linux still starts programs without providing a vDSO, meaning that the global structure still needs to be initialized with fallback implementations that use syscall. A future project would be to let the kernels also provide a valid vDSO, so that the fallback implementations can be removed from the executables entirely. This would allow us to scrap system call numbers from the ABI. Combined with symbol versioning, this would make it a lot easier to provide compatibility going forward.

Closing words

This brings us to the end of our second article. Stay tuned for next week’s article, in which we’re going to take a look at how the emulator and the program work together to provide efficient thread-local storage.

As before, be sure to send an email to info@nuxi.nl or send me a message on Twitter at @EdSchouten if you have any feedback!