Communication in Systems, a Review

Posted on January 5, 2016 by Gabe Parmer

Composite focuses on using fine-grained, isolated components to enable system policy customization, increased fault isolation, and decreased scope for privilege escalation. Composite leverages hardware page-tables to provide memory isolation between components. With dual-mode protection hardware, switching between page tables is a privileged instruction, thus can only be performed by the kernel. Thus, control flow transfer between components must be mediated by the kernel. This relationship between the hardware and software motivates the design of efficient communication (IPC) via kernel design. This article gives an overview of a few IPC designs from different systems, to lay the ground-work for a future article that provides a summary of the Composite design. It stays away from too many concrete implementation details, thus focuses on the high-level designs. That too, is the focus of future articles.

IPC Designs

There are a few different factors that fracture the design space for system-provided IPC. The major dimensions are:

Control transfer:

Data transfer: Data is often passed between components, which is a key aspect of message passing.

Multi-core:

I do need to point out that the programming model that is exposed to application developers is often orthogonal to these dimensions. The same APIs and behaviors can be layered on top of many different IPC mechanisms that differ across the factors above. This document does not discuss the programming models.

Synchronous Rendezvous between Threads

I’ll start with where much of the research community ended up. Lietke went against convention at the time, and made a strong argument for synchronous IPC implemented as rendezvous between threads in L4. Two threads (one per-component in our example) communicate by following a protocol. A client sends a notification to the server, and simultaneously blocks till the server completes processing on its behalf. The server simultaneously replies to their last client, and either blocks waiting for the next, or receives the next request if it is pending. Much of the cleverness and motivation behind this approach centered around optimizations this design enables. These include lazy scheduling, direct switching, and thread-as-message-queue to avoid memory allocation. It also has breadth of design as IPC was used as the primary mechanism for memory mapping, interrupt-handling, and fault handling.

The original implementation focused on unconstrained data movement. Using some tricks, data could be copied between processes at the cost of a single copy (this should be surprising and impressive, if you think about it). However, it has the side-effect of significantly complicating the kernel. Consider 1. copying arbitrary buffers can cause page-faults, and 2. page-faults in this system are serviced by threads in other processes. So an IPC can fail at some arbitrary intermediate processing step, and require a nested IPC. When this nested IPC completes, it must pick up on the original IPC where it left off. This causes highly specialized processing in the page-fault handler, and requires some representation of continuations. We start to see why unconstrained data movement is something that can complicate the design space significantly. These (and other) problems have motivated Shapiro to deem synchronous IPC vulnerable to attacks.

Modern implementations of this IPC mechanism (Fiasco and seL4) use constrained data movement by default. The size of the amount of data transferred differs, but it is often around 64 words, and always smaller than the size of a page. In fact, each thread has a structure shared between kernel and user, and the data to be shared and copied is simply copied between these structures. Now, as the only copy is between memory that is guaranteed to be mapped in, faults are averted.

Though Fiasco does implement transparent cross-core IPC, these systems are primarily designed around intra-core IPC.

A great paper recounts some of the changes to system design from the original L4 to seL4. This paper is fairly limited as it only covers one system, and it isn’t an introduction, so your mileage may vary, but it is a nice, 20-year war story.

K42 synchronously communicates between dispatchers, which are essentially kernel threads (i.e. threads that the kernel is aware of, and schedules). Execution in the server is performed via an upcall that then can decide to switch to a specific user-level thread to handle the request. This form of IPC is as related to scheduler activations in the kernel/user coordination, as it is to L4-style IPC in the kernel.

Asynchronous Message Passing

Perhaps the most famous version of asynchronous IPC is UNIX pipes. Mach popularized microkernels (then disillusioned many about them) and is based on asynchronous IPC. It still lives in OSX, but the microkernel bits aren’t extensively used for their original purposes, as far as I know. Singularity uses language-based protection, and asynchronous message passing between threads. Xen uses asynchronous IPC between virtual machines. All modern L4 variants also, somewhat surprisingly, include some form of asynchronous IPC. It should be noted that any inter-core communication is almost necessarily asynchronous (for the same reasons that synchronous RPC is rarely used on the internet).

Clearly, the asynchronous IPC thing must be a solved problem by now.

Lietke’s paper is a practical argument for synchronous IPC. It argues that it was (at the time) the only known implementation that enables an efficient IPC implementation. I’ll take the liberty of summarizing many disparate sources, and use my own judgment to answer the question of why asynchronous IPC can be inefficient. Then, I’ll make the opposite argument. As it turns out that in systems, everything’s a trade-off. Go figure.

Why is Asynchronous IPC Inefficient?

  1. Memory copying. The data that is sent from the client will not be accessed by the server until some indeterminate time in the future. Regardless, we want to continue to allow the client to access and modify its own memory without worrying about if the server has processed it yet. Thus, the typical implementation (e.g. pipes) copies data from the client into the kernel, and from the kernel into the server. This enables the client to modify and use its own buffers immediately after it sends the data. Memory copying trashes cache contents and is one of the fastest way to kill your performance on modern systems where memory is significantly slower than CPU. Mach attempts to avoid the copy by using copy-on-write which ends up not working very well, especially on multi-cores where page-permission modification is hugely expensive.
  2. Memory allocation. The space in the kernel to store the message and its data must come from somewhere. The options are to either use a fixed size buffer (pipes), or allocate space (Mach). Allocation is expensive relative to the fast-path of synchronous IPC, and has very bad edge-case behavior. It can be very expensive, pushing up tail-end latencies, and can always fail. An additional complication is that a client in an infinite send loop, with a server that never receives, consumes an infinite amount of kernel memory. A convenient DoS. This is solved, as with pipes, by placing a finite upper bound on the asynchronous behavior before the system switches to synchronous. Thus, clients and servers must include logic that considers both synchronous and asynchronous behaviors.
  3. Scheduling overheads. The system has to switch between threads at some point so that the server can process the client’s messages. This operation includes all of the overhead of scheduling and dispatching. These are efficient operations in most contexts; the kind that are so small they can be ignored. Relative to synchronous IPC latencies, these overheads are non-trivial (i.e. thread switch in Linux is generally higher than IPC in Composite).
  4. Separation of event notification from communication. Synchronization is often essential at some point. At the bare minimum, the server often wants to block if there are no more messages to produce. The event notification APIs are separate from message reception (select vs. read), thus represent pure overhead on top of the message passing path.

If you’re banging your fist down while furiously calling foul at this point, take a deep breath. Many of these overheads are artifacts of specific implementations. However, they are, historically, major contributors to slow asynchronous IPC communication paths. It should be noted that IPC in many of these previous systems did not really need to be fast. For example, UNIX pipes really don’t need to be that speedy to be useful. “Slow” is not equivalent to “bad”. Pipes are a great API that are easy to use. Slow is often fast enough. In systems that liberally use protection domains for increased isolation (I’d argue that VMs fit into this category), we need something that is more efficient.

Why is Asynchronous IPC Efficient?

Many systems have done significantly better through light-weight designs. These include:

From these examples, we can informally extract some themes for making efficient, asynchronous IPC:

Communication Design Sweet Spots

Given 35 years of research into these topics, it seems that the community has found two sweet spots in the design space for high-performance communication.

  1. synchronous, intra-core, with restricted data movement
  2. asynchronous, inter-core, with restricted data movement

One obvious observation is that it seems that restricted data movement has won. This implies that systems require an orthogonal mechanism for data transfer. A topic for another day.

Some might argue that VM IPC in systems such as Xen is asynchronous and intra-core, and is conspicuously absent in the sweet spot characterization above. This is true. Look at the communication latencies on such systems for a counter-argument.

Taking a dogmatic approach by choosing only one of these two designs has had complicated results. The fully synchronous L4 had a problem. In practice, they found that you need a thread per server, per application thread if the server, or some server that it leverages can block for long amounts of time. This was done to prevent one application thread from introducing arbitrary blocking latencies into another (possibly unrelated) application thread. This necessitates multi-threaded servers, and a lot of threads. On the other side, a fully asynchronous approach can cause resource management difficulties, especially when a single event handled by a server can cause many subsequent events in other servers (which can cause many subsequent events…). Who to run at any point, and how that decision impacts end-to-end latency, is not at all clear.

The interesting question is when and where should each communication mechanism be used? These sweet spots certainly give a fair amount of guidance, but they hide important details. System design, unfortunately, cannot be entirely reduced to sweet spots. Thus, I’ll leave this as the topic of another post.

Composite IPC

This is a long history that is mostly trivially touched on above. It is usually not wise to do research in a space that has such a history, and seems to have converged on a few designs. However, as with all research, when there is a very good motivation to change the underlying assumptions of the system, a new design is viable and interesting. The next article will summarize the design of IPC in Composite.

Edits