Nuxi The CloudABI Development Blog

Argdata: a binary serialisation format

March 10, 2017 by Ed Schouten

You may remember from the demoes presented in our previous blog posts that most CloudABI applications start up through a function called program_main(), and that this function has a single argument of type const argdata_t *. This differs from traditional C programs, which start through main(), receiving a list of string command line arguments. In today’s blog post, let’s discuss in a bit more detail what CloudABI’s argdata_t type is and what it tries to solve.

Problem statement

One of the goals behind CloudABI is that programs start up in such a way that they are sandboxed as soon as the first instruction gets executed. In effect, CloudABI programs can only interact with file descriptors that are available on startup, and those derived from them.

Though this model is very powerful, I noticed very early on that it doesn’t integrate well with existing workflows for starting programs. For example, most command line applications allow you to pass in filenames as arguments. By the time these tools are capable of processing these arguments, the program is already sandboxed, meaning the actual files can no longer be accessed. The same also problem also applies to applications using configuration files, as the configuration files may also refer to other paths on disk.

Our approach to solve this has been to replace command line arguments by a YAML-like tree structure, where in addition to the normally used data types (strings, integers, dictionaries, lists, etc.), file descriptors are also a primitive type. This way, the configuration of a program effectively also acts as its security policy.

When executing a new CloudABI process, the kernel needs to copy the tree structure into the new process’s address space, which is why it needs to be serialised into a contiguous block of data. At the same time, this mechanism should also ensure that only the file descriptors referenced by the tree structure remain open.

Existing serialisation formats

When I started working on this, I wanted to see whether I could base my work on an existing serialisation format. Below are some examples of the formats I considered, together with some advantages and disadvantages of using them for this specific purpose.

Textual formats: YAML and JSON

Binary formats: MessagePack and BSON

Special-purpose formats: FreeBSD’s libnv

Argdata

Not being fully satisfied with any of the serialisation formats I could find, I eventually decided to define a custom serialisation format, called Argdata (‘arguments data’). In essence, it’s a binary encoding of the most commonly used data types of YAML. Composite types (map and seq) keep track of the size of each of their children, making it possible to efficiently scan through them without recursing. When reading, memory overhead is relative only to the number of iterators allocated.

A reference implementation of Argdata, including a specification of the binary encoding, can be found on GitHub. The reference implementation is written in C, but has very good C++ bindings as well. The C++ bindings make use of various C++17 features, like std::string_view and std::optional, which makes the API both efficient and very friendly to use. I’d like to thank Maurice Bos for contributing these bindings to the project!

For the remainder of this article, let’s take a look at what these C++ bindings look like.

Serialisation

Below is a simple example of a function that creates a serialised copy of a dictionary containing two keys. As argdata_t::create_map() doesn’t take ownership of its children, the function has to ensure all nodes are deallocated before returning. This is done automatically, as all create_…() functions make use of std::unique_ptr.

#include <argdata.hpp>
#include <memory>
#include <vector>

std::vector<unsigned char> create_object() {
  // Construct {"Hello": 5, "World": true}.
  std::unique_ptr<argdata_t> hello = argdata_t::create_str("Hello"),
                             five = argdata_t::create_int(5),
                             world = argdata_t::create_str("World");
  const argdata_t *keys[] = {hello.get(), world.get()};
  const argdata_t *values[] = {five.get(), argdata_t::true_()};
  std::unique_ptr<argdata_t> root = argdata_t::create_map(keys, values);

  // Serialize the object.
  return root->serialize();
}

Deserialisation

The following piece of code shows what the deserialisation process looks like. Every node in the tree has a number of get_…() functions that allow you to access that node’s value, converted to a native type. For maps and sequences, these functions return an object that can be iterated. The return value of the get_…() functions is wrapped into an std::optional, so that type mismatches can be detected easily.

In cases where you don’t really care about type mismatches, nodes also provide as_…() functions that fall back to returning a default value upon failure. In those cases, as_map(), as_seq() and as_str() will return empty maps, sequences and strings, respectively.

#include <argdata.hpp>
#include <optional>
#include <ostream>
#include <string_view>
#include <vector>

void print_object(const std::vector<unsigned char> &buf, std::ostream *s) {
  std::optional<int> hello;
  std::optional<bool> world;
  std::unique_ptr<argdata_t> root = argdata_t::create_from_buffer(buf);
  for (auto [key, value] : root->as_map()) {
    if (std::optional<std::string_view> keystr = key->get_str(); keystr) {
      if (*keystr == "Hello")
        hello = value->get_int<int>();
      else if (*keystr == "World")
        world = value->get_bool();
    }
  }
  if (hello)
    *s << "Hello: " << *hello << std::endl;
  if (world)
    *s << "World: " << *world << std::endl;
}

Availability

When the Argdata library was originally implemented, it was part of CloudABI’s C library, so that it could be used by CloudABI programs to pass data to subprocesses. The cloudabi-run utility included a trimmed down copy of this library as well, as it is needed to pass data to initial CloudABI processes.

As we think Argdata may be useful outside of CloudABI as well, we recently decided to separate it from cloudlibc and turn it into a separate project. The code has been made portable and now lives in its own repository on GitHub. Be sure to give it a try!