Posted on May 7, 2016 by Gabe Parmer
These questions often cause a lot of confusion on Internet message boards. Simple pronouncements that some system operation is expensive or cheap seem to be convincing if they are yelled loud enough, and with enough authority. But as with many issues around system implementation, understanding what’s really going on at the lowest level often leads to a more nuanced understanding of high-level application behavior.
This post will introduce some of the key operations in systems from which most of the software is built. They define the software structure, the trust relationships, and many of the overheads that the system experiences. Many people go into operating systems with the expectation that if they can remember the large volume of topics and concepts, that they’ll do fine. Many people view learning OSes as similar to learning Biology. Remember all of the different parts of a cell, what parts they play in the metabolic cycle, and you build your education in Biology (I’m over-simplifying, of course). In OSes, it is comparably important to understand the higher-level concepts, the system classifications, and the resource management policies. In contrast, it is more important to understand why these concepts exist, why they were created, and why the policies take the form that they do. Systems were created by human intellectualism, so there should be a deductive, rational path that leads up to each design, and each aspect of systems software. Understanding this logical path will lead to a concrete understanding of the trade-offs that go into each system. The starting point to understanding the reasoning behind each system is to understand the fundamental “atoms” of the system1. These are the architectural proclamations that each OS must consider in their design.
An incomplete set of system atoms includes
For now, I’ll omit (important) atoms having to do with input and output (including networking, non-volatile storage). I’ll also omit the (very important) atoms relevant to multi-core and multi-processor systems. Those will be the subject of another post. If any of these atomic elements were to change significantly (e.g. dual-mode execution went away), system design would change significantly.
Understanding how simple loads and stores interact with the architecture is fundamental to understanding the performance of every operation in the system. Memory cannot both be large (many GBs), and as fast as the processor. Thus, architectures have caches that attempt to provide the illusion of fast access to large amounts of memory. To get good performance, locality must be maximized.
Caches often have at least three levels (L1 through L3 caches). Ballpark cycle counts for issuing a load or store that are satisfied by a specific level in the hierarchy:
To be clear, that means that if you issue a single load or store instruction, it can take up to 500 cycles to finish if the memory is not in cache. That delays all processing in your processor pipeline, essentially making it execute 500x slower than it should! There’s no quicker way to make your 2GHz processor feel like it is executing at 1MHz, than to use your caches badly. No problem! The 80s were great for speedy computation, right?
For systems code, this is particularly relevant as “pointer chasing” is quite common. Pointer chasing means that the operations of the system often include looking up some node in a data-structure (e.g. a dentry), finding a pointer to the next node (e.g. an inode), etc. Each node is not necessarily accessed for very long (they are simply used to find the next node), and each one requires cache-line accesses, each with very little locality.
So how do we think about the performance characteristics of the caches, and what can we do to use caches well?
Cache-lines. Memory is broken into 64 byte aligned, 64 byte chunks called cache-lines. When you access an address, the cache-line containing that address is brought as close to the processor as possible (into L1 cache). There is an incentive to place data that will be accessed together (with temporal locality) into the same 64 byte chunk, when possible. This is part of the reason that arrays are so much faster than linked lists. Multiple entries can fit into the same cache-line, so fetching a single cache-line into L1 also fetches adjacent entries.
Pages. Virtual address translations are done at the page granularity - at chunks of 4KB. Parsing through page-tables (see below) to translate a page’s virtual address to a physical address is expensive – 2 (4) memory accesses on x86 (x86-64). To avoid this cost on every memory access, the Translation Lookaside Buffer (TLB) caches translations. As with cache-lines, data-structures operations that perform memory accesses within the same page require fewer entries in the TLB to cache their translations, thus imposing less TLB pressure. Where TLB pressure is causing prohibitively high TLB miss rates, larger page sizes can be used. Available sizes across a few architectures include 64KB, 2MB, 4MB, and 1GB. Larger pages mean that the TLB can (in total) hold translations for more memory.
TLB caches are virtually tagged and indexed. This means that they must either be tagged, or flushed when address spaces are switched (more below). The implication of flushing the TLB is that subsequent memory accesses will need to be translated through the page-tables.
Locality. I’ve discussed two forms of spatial locality above - inside a cache-line, and inside a page. Most modern architectures include a prefetcher that imposes its own locality optimizations. If memory accesses are made in a pattern that resembles sequential access (e.g. increasing memory accesses with a certain stride), then the hardware logic will speculate and pull in future cache lines before they are actually accessed. The prefetcher is rather aggressive on Intel x86; touching a single address on a cache line tends to pull in the next cache-line, unless it crosses a page boundary. It appears that Intel essentially thinks that cache-lines should be 128 bytes, but that they aren’t willing to risk increasing TLB pressure on that bet.
Capacity. Caches have capacities that increase with increasing cache levels. For example, L2 is larger than L1. The more data that gets accessed, the more pressure it puts on the caches (i.e. the higher the probability that a memory access will be satisfied lower down in the cache hierarchy). Any operations that require large amounts of data accesses pressure the previous contents of the cache.
Example: memcpy
. memcpy
accesses two arrays of memory, copying one into the other. The number of cache-lines accessed can be large, but since accesses are sequential, the prefetcher can attempt to hide the latency of bringing many of the cache-lines into closer levels of the cache. However, each of these cache-lines brought into higher-levels of caches, evict other cache-lines. When those are accessed in the future, they will miss in the cache, causing higher overheads as the processor must look lower in the cache hierarchy. So the indirect costs of memcpy
can be significant.
All system software is subject to the whims of the memory hierarchy and its fickle reliance on locality. You can ignore it as much as application developers typically do, and the speed and intelligence of modern processors will mostly compensate. However, if a path in the operating system must be fast, or the OS must minimally interfere with the cache footprint of applications, considering locality in your code is essential. Regardless, counting the number of cache-lines, and pages touched for key operations is a reasonable approximation of performance. Multiply those numbers by three to assess the best possible performance. For a reasonable approximation of performance, calculate \(N \times\) the number of cache-line accesses plus \(2N \times\) the number of TLB accesses where
Some hardware operations are privileged, and should not be broadly available to all computations in the system. Booting up a system builds a complex, and organized environment out of a very primitive one. Examples of such operations include changing processor execution state (real to protected mode in x86), switching page-tables to change the accessible memory, talking directly to I/O (i.e. via in/out port instructions in x86), and sleeping or halting the processor.
Modern CPUs provide dual-mode execution that enables a core to be in one of two states at any point in time, either kernel-mode, or user-mode3. When in kernel-mode, code can execute privileged instructions4. The kernel is the code that the system executes in kernel-mode. When in user-mode, code that attempts to execute privileged instructions will trigger a hardware exception (as in, a software trap).
Additionally, dual-mode execution is always paired with some hardware protection facilities that prevent user-mode execution from accessing the kernel’s code and data. If the point of differentiating normal and privileged instructions is to prevent user-mode code from having total control of the hardware, the protection of kernel-mode code and data from user-mode is also essential. All of this means that dual-mode execution provides a base-line of protection on the system between code executing at kernel, and user-level.
This all emphasizes the question, “how do we switch between user-mode and kernel-mode”?
Software interrupts (system calls and exceptions) - There are many different names for what amounts to very similar operations. Code executing in user-mode can trigger a trap with an explicit instruction (sysenter
, syscall
, or int
on x86) which 1. switches CPU mode to kernel-mode, 2. saves some registers onto a stack in the kernel, and 3. starts execution at a pre-defined instruction in the kernel. Alternatively, a trap can cause an exception that triggers when an instruction is executed that would put the processor into a undefined state. Division by zero, loading or storing kernel memory while in user-mode, or executing a privileged instruction while in user-mode are all operations that will cause an exception. An exception will 1. switch CPU modes to kernel-mode, 2. save the current register contents (often onto a stack in the kernel), and 3. start execution at a pre-defined instruction set up by the kernel. Note that an exception can trigger while executing in kernel-mode. It is important that the instruction executed upon a trap is controlled by the kernel, so that user-mode execution can’t execute arbitrary code in the kernel. Thus, the kernel often initializes the processor with this instruction at boot time.
Interrupts - When I/O devices need to notify the system that the CPU needs to perform some operation (e.g. talk to some device driver), or when the crystal oscillator causes a timer tick, an interrupt is triggered. Note that the CPU could be executing in either user- or kernel-mode when the interrupt triggers. When this happens, 1. the CPU switches to kernel-mode, 2. it saves the current register set (again, often onto a kernel stack), and 3. starts executing at a kernel-defined instruction.
To return from kernel- to user-mode, privileged instructions exist (sysexit
, sysret
, and iret
on x86) that generally 1. restore some subset of the registers, 2. switch to user-mode, and 3. start executing at an instruction controlled by the kernel.
It is not a foregone conclusion that processors provide dual-mode execution. That our systems have a kernel is entirely based on this processor design decision. However, it is important to understand that something in the system must define protection, isolation, sharing, and resource multiplexing decisions. Currently, the kernel facilitates each of these, if not defines these policies outright (e.g. monolithic kernels). If we did away with dual-mode execution, the trick (and historic difficulty) is how to define all of these policies without imposing large restrictions in the hardware design. Put another way, if we remove dual-mode execution, what are the new atoms of the system, and how egregious are the restrictions they place on the system?
Given that dual-mode facilities are one of our atoms, it is important to realize that switching between modes is not free. Minimally, the costs include
int
). Often this is held in a register (e.g. the SYSENTER_IP_MSR
model-specific register).The direct costs of system call and return overheads amount to between 60-120 cycles on x86.
The kernel itself has a cache footprint in both the TLB and the instruction and data caches. Thus, the indirect costs of system call execution is due to the kernel and application applying more pressure on caches than each would individually. This can be quite significant, but is application- and kernel-specific. For this reason, where possible, the kernel wants to minimize its TLB and cache footprint.
If dual-mode execution provides an environment for user-mode execution that has a subset of the privileges of the kernel, Virtual Address Spaces (VASs) enable two main capabilities:
Isolate the memory of different user-mode computations (henceforth, “user-level”) from each other. Whereas dual-mode isolates the kernel from user-level, VASs enable the kernel to control and partition the different sets of memory available to each user-level component6. If the kernel is isolated from user-level components, and the kernel has the ability to partition memory to different user-level components, then user-level components can be isolated from each other. They are isolated in the sense that the kernel can implement policies that provide whatever form of isolation it requires. For example, it can control the intersection between the different sets of memory available to components.
Virtualize memory addresses. Virtual addresses are independent in different components. If component \(c_0\) stores to address \(a\), component \(c_1\) stores to address \(b\), where \(a = b\), the actual physical memory (at physical addresses \(v_0(a)\) and \(v_1(b)\)) modified is different for each component (\(v_0(a) \neq v_1(a)\)). In contrast, it is possible that \(v_0(a) = v_1(b)\) even when \(a \neq b\). Each component is given a separate namespace of memory - they each believe that they have access to the complete complement of memory from address \(0 \to N\), and that this memory is separate from other component’s. Even though the hardware has access to only a range of physical memory \(0 \to M\) (i.e. how much memory you have in DIMMs, often \(M \ll N\)), VASs separate the addresses each component sees from the address of the physical memory.
The kernel has control over each component’s VAS. “Control”, in this context, means controlling the translation between a virtual address, \(a\), and the physical memory, \(v_0(a)\), for each user-level component. The kernel must 1. control the mapping function \(v_n\), and 2. control which of those mappings is active at any point in time on each core. Two main mechanisms enable this control.
Page-tables which are a data-structure that provides the mapping \(v_n\). Any data-structure that maps between two values would work here (think: a hash-table), but most modern systems use a radix trie. To make the mapping more concrete, they primarily provide the function phys_addr_t page_table_lookup(virt_addr_t)
. The kernel modifies the data-structure to add new mappings, and remove existing ones. A separate page-table exists for each component. Each memory access with a component is translated from virtual to physical, as guided by page-tables. Page-tables perform translations for pages, thus re-enforcing the need for efficient use of spatial locality. A page-table has a specific number of levels7 which are always traversed when translating between virtual and physical addresses. On 32 bit systems, page-tables are often 2 levels, and on 64 bit systems, they are often 4. See the previous section on the costs of memory accesses to assess the overhead for translation. Note that the TLB, caches page-table translations to avoid the need for this costly operation on each memory access.
Page-table activation/loading is achieved by placing the address of a page-table into a “page-table control” register (e.g. control register three on x86 - movl %eax, %cr3
). Doing so is a privileged operation, thus ensuring that only the kernel has the ability to control VASs. When switching between components, the kernel writes the destination component’s page-tables into the page-table control register. If TLBs aren’t tagged, then writing to this register flushes the TLB. That means that you’ll suffer a large number of TLB misses after switching, which will require many page-table translations. There is quite a large benefit to tagged TLBs if you’re switching between components frequently.
Interactions between dual-mode execution and VASs. How does the processor know which memory regions are part of the kernel, so that user-level accesses to that memory cause exceptions? This is an essential to maintain the dual-mode protection of the kernel. As it turns out, many page-table structures have a single bit per mapping entry that denotes if the page is kernel-mode. Thus, if a processor is executing in user-mode, and attempts to translate an address that has the kernel-mode bit set, an exception is generated (general protection fault in x86), and the kernel must include logic to deal with the illegal access.
If your software uses spatial locality on the scale of pages well, it will use the TLB cache better. This will minimize the frequency of page-table translations, which, given the math for computing memory access-based performance, can significantly decrease performance.
Switching between VASs will flush the TLB if it isn’t tagged, which causes significant indirect overhead. If the VAS that is switched away from uses \(N\) TLB entries, the switch will indirectly cause quite a few TLB misses when we switch back to that VAS. This costs up to \(\alpha N H\) cycles, where \(\alpha\) is the cost of memory accesses (between approximately 3 and 150), and \(H\) is the height of the page-table. The direct cost of switching between VAS involves the processor overhead for writing to the page-table control register. This is a serializing instruction, thus removes the OoO parallelism the processor is trying to hard to achieve. On x86, the observed direct overhead of loading a page-table is around 50-100 cycles (lower-bounded by around 14 cycles).
I want to give a couple of examples of how to apply knowledge of atomic elements to understand the overheads and relationships in compound8 bodies of software.
Threads are a software abstraction. They are not the only abstraction for concurrent execution. For example, tasks provide a finer-grained, transient abstraction for parallelism, and events enable manual interleaving of concurrent events. Interrupts and multi-core systems provide the basis for concurrency and parallelism in systems (though cooperative scheduling alone is sufficient to require concurrent programming).
Minimally, switching from thread \(\tau_f\), to thread \(\tau_t\) requires
Switching between threads requires some cleverness. We have to save and restore not just the general purpose registers, but also the instruction pointer and the stack pointer. Take a few seconds to think about why this might be complicated.
If the number of registers we need to save is on the order of 32 words, these might take up around 2-4 cache-lines per thread. As these are often saved sequentially in memory, the prefetcher will likely hide the cost of the cache-lines after the first. The threads each likely have a separate structure (e.g. a thread control block (TCB)), and separate stacks. Each of these might use a separate TLB entry, though many systems (e.g. Linux) co-locate the TCB on the same page as the stack to avoid this overhead.
So we have the following cache-line accesses:
There indirect costs to thread switching. These are best demonstrated assuming that we have two thread switches from \(\tau_f \overset{dispatch}\longrightarrow \tau_t \overset{dispatch}\longrightarrow \tau_f\). Before the first switch \(\tau_f\)’s memory contents are in processor caches. When \(\tau_t\) executes, it might bring all of its memory contents into the caches, evicting all of the previous context that \(\tau_f\) had. When we switch back to \(\tau_f\), the indirect cost is paying the memory cost (150-500 cycles) to bring each cache-line back into caches. Put another way, thread switches exhibit poor temporal locality.
If we are using 1:N threading, then the overheads likely end here. User-level can switch between its own threads with only these overheads. If we are instead using 1:1 threading, then the kernel switches between threads. Thus, we must switch to the kernel to perform the dispatch operation. This adds the overhead of a system-call (or interrupt if we switch in response to an I/O event). Additionally, if we are switching between two threads in separate virtual address spaces, then we add the cost of switching between VASs as well.
These costs for 1:1 threading are dictated by the following sequence of actions:
Synchronous IPC can be implemented as synchronous rendezvous between threads. In this case, two threads cooperate to emulate the semantics of a function call. \(\tau_f\) “calls” a function provided by \(\tau_t\) and waits for it to be calculated before returning. We’ll assume the threads are in separate address spaces, for now. Lets observe the necessary overheads for this:
The kernel design dictates how much software overhead there is between each of these operations. Optimizing the IPC path involves minimizing this overhead.
Asynchronous IPC enables two threads to pass messages and events between each other. I’ll assume for simplicity that the API exposed for this mimics the POSIX read
and write
. Notably, a read with no available messages blocks the calling thread. Most students seem to believe that asynchronous IPC is fundamentally faster. This should show that the truth is nuanced. So for two threads to exchange data involves:
write
read
read
write
read
read
Notably, there are twice the number of system calls. However, if we assume that the threads can send multiple messages before awaiting a reply from the other thread (or that there is no reply), then the costs of switching between VASs and threads is amortized over multiple messages. In the case where VAS and context switching is expensive, this can pay off. However, it relies on the software in each thread to be able to generate multiple messages before synchronizing with the other thread. This might be OK when we’re talking about sending networking packets to an HTTP server, but less so when asking another server for memory in response to malloc
.
I stated at the start of this that once you understand the atoms, you can extrapolate which operations are required for different higher-level operations (e.g. IPC), and have a good estimate of the lower-bound, and upper-bound on performance. However, if you change the atomic elements, everything changes. I want to give one vision of what this can look like through a few research projects.
Singularity is a former Microsoft project that eschews hardware protection based on dual-mode and VAS, instead replacing it with language techniques based on type safety. The new atoms of the system are the language mechanisms for type-safety, garbage collection, and messaging primitives. The atoms are in the software, and are, in some sense, more difficult to understand as they are quite complicated. But the fundamental equations for system calls and IPC change completely.
CHERI is a new architecture based on hardware capabilities to constrain accesses to different regions in memory. It can be used to partition memory between different “components”. Notably, the protection of memory can be done at the byte granularity, rather than page granularity. This is quite nice because we can provide protection at the granularity of actual programming constructs like arrays and struct
s. In contrast to VASs which provide the virtualization of memory so that each component see a different mapping of their addresses to physical memory, CHERI only provides protection, not virtualization. CHERI capabilities can be combined with VASs, but if they aren’t, then many of the VAS-specific atoms are no longer relevant. Instead, the new atoms are the mechanisms for managing, delegating, and revoking CHERI capabilities.
So it is important to always keep in mind what atoms your working with, and construct your view of compound system structures based on these atoms, but it is also important to be cognizant of what your atoms could be. Changing your frame of reference is often a useful thing to do when conducting research.
I’m pretending that there are no Quarks, Leptons, or Bosons here. If you were suitably imaginative, you might imagine these are instead the hardware design that build up the atoms on which the OS is built.↩︎
What is a cycle? Your CPU ripping along at 2GHz is executing 2 billion cycles per second. The CPU’s pipeline generally ticks along once each cycle. Some operations take more than a cycle (for example, division, loads), which is why cores are internally doing many things at once. To keep the CPU busy, instructions don’t have data dependencies can execute at the same time.↩︎
This is more complicated on x86 given x86 support for four rings, but we simplify for the sake of generality.↩︎
Similar to sensitive instructions, with the main difference appearing with virtualization.↩︎
Technically, all of these instructions write to non-renamable registers/state. In this case, the mode-bit is written to, and the Out-of-Order (OoO) pipeline of the core can’t do its magic without register renaming.↩︎
We use the term “component” here as a generic computation consisting of some code and data that executes at user-level. Components might be processes, virtual machines, micro-kernel servers, depending on the type of system.↩︎
Using different page-sizes means that radix-trie translations for specific addresses might require traversing different numbers of levels.↩︎
Yes, we’re abusing the chemistry analogy.↩︎