Kernel 2: Process isolation (2017)

These 2017 notes are being provided as interim notes until the 2018 notes are ready.

Kernel vs. process

The kernel is the piece of operating system software that runs with full machine privilege. It starts up when the machine boots and keeps running until the machine shuts down.

The software we tend to think of when we think of an operating system—the window system, a compiler, a text editor, a web browser—does not run with full machine privilege. It runs in unprivileged mode, in units called processes (or user processes).

A process is a program in execution. Processes come and go; they can be started, and they can exit or be killed. A single program—that is, an executable binary, like ls or emacs—might be running simultaneously on the same computer in many different processes.

Another way to think about it is that a process is a virtual computer. Each process can behave like it has its own processor and memory. The operating system, and especially the kernel, manipulates the hardware so that different processes don’t interfere inappropriately.

Starvation

The kernel aims to provide fast & fair sharing of machine resources. But true fairness is very difficult to enforce, and it’s not always the right goal. Some processes are more important than others, and some processes have greater or lesser resource requirements.

The minimal version of fairness is avoiding starvation, and almost all OSes provide this goal.

Informally, a process starves, or experiences starvation, when it gets no access to a resource. Formally, a process starves if it demands the resource infinitely many times, but acquires the resource only finitely many times. (We need the formal definition to cover cases like if a process asks for a resource just once—for instance, the process tries to send a single network packet, and doesn’t succeed because the machine is too busy, but never tries again.)

In our Eve vs. Alice infinite-loop example, when Eve wins, Alice is prevented from ever accessing the CPU, so Alice starves. When we introduced timer interrupts, Eve still got to run more often—every time Alice ran, Alice voluntarily surrendered the CPU to Eve, but Eve never voluntarily surrendered the CPU—but Alice did not starve.

Process isolation

Another, more formal, way to put this goal is that the kernel aims to provide process isolation. This means that

Processes can interact with one another only as allowed by OS policy.

The introduction of policy here is what makes OSes different. Almost all OSes include starvation avoidance in their policy. Modern OSes also aim to isolate processes from bugs in other processes. For example, in modern OSes, a memory bug in one process will not affect another process, because processes have logically distinct memory spaces—each process behaves like it has its own primary memory. But processes are not completely isolated; communication is allowed, but it must take place over channels allowed by policy, such as files or explicitly shared memory.

(Older OSes implemented weaker forms of process isolation; for instance, in all versions of Mac OS before Mac OS X, there was no memory isolation, so a memory bug in one process could bring down the whole machine!)

Stronger forms of process isolation are difficult to implement, because both processes and the kernel are just software—executable instructions telling the processor what to do. The software instructions executed by the kernel have full control over machine resources (otherwise the OS would be crippled). But a process with the same power could easily violate process isolation! This is why we need need hardware help. The hardware must offer a notion of privilege, where certain instructions—we call them dangerous instructions—cannot be executed, except by privileged code: the kernel.

x86-64 privilege

x86-64 computers implement privilege using a notion called the current privilege level, or CPL. This value can be 0 or 3. (It can also be 1 or 2 but forget those, they are very rarely if ever used.) 0 means privileged, 3 means unprivileged. This, upsettingly, means that the numerically smaller privilege level means higher privilege. Oh well.

CPL is stored in a special-purpose register called the %cs register. Historically this register was frequently used, but now it’s basically only used to hold the CPL. Dangerous instructions check the CPL to determine how to behave; for instance, I/O instructions, like cli (disable interrupts), fault if the current CPL is too unprivileged to perform the requested action.

Instructions that change the %cs register are dangerous! Code is allowed to reduce its privilege—to go from a higher privilege level to a lower level, i.e., from CPL 0 to CPL 3; but any attempt to raise the privilege level, i.e., from CPL 3 to CPL 0, will cause a fault.

Protected control transfer

But unprivileged code still needs to transfer control to privileged code—for system calls, for instance. The hardware supports a mechanism for safely raising privilege in this way, called protected control transfer.

On x86-64, most protected control transfer uses the processor’s exception mechanism. An exception is an event that interrupts the normal instruction flow, transferring control to the kernel. Exceptions come in three varieties: a trap is initiated by software, on purpose (e.g., the int instruction); an interrupt is initiated by hardware devices, by surprise (e.g., a timer interrupt); and a fault is initiated by the CPU, in response to a software error (e.g., a segmentation fault or privilege violation).

The x86-64 CPU responds to exceptions using information pre-specified by the kernel. Specifically, as it boots up, the kernel initializes a set of interrupt tables (and other structures) that define, for each kind of exception, exactly what the CPU should do. An interrupt table entry defines these important aspects of protected control transfer:

The new stack pointer;
The new instruction pointer;
And the new privilege level.

To handle an exception, the CPU looks up the relevant interrupt table entry; stores several special-purpose registers onto the defined new stack (including the old %cs, %rsp, and %rip); and sets the new %rip and %cs to the defined values. In your WeensyOS, the interrupt table setup occurs in k-hardware.c:segments_init (using macros like set_gate), and the exception handlers are located in k-exception.S.

Protected control transfer is safe because the kernel pre-specifies allowed entry points. If entry points were not pre-specified, and the processor let processes jump to privileged code at any address, then processes could probably accomplish nefarious goals and violate process isolation by choosing just the right unexpected entry point. This would be a privilege escalation attack, because unprivileged code was able to effectively raise its privilege level by abusing an operating system feature.

The instructions used to modify are interrupt tables (e.g., lgdt, lidt) are, of course, dangerous—a process that can install its own privileged exception handlers can easily escalate its privilege.

Virtual memory

But it’s not only instructions that require protection. Memory also requires protection—for instance, it contains instructions. A process that can modify kernel code can clearly accomplish any goal that requires privilege.

Modern OSes isolate process memory from kernel memory (“kernel isolation”), and also isolate different processes’ memory from each other. Each process has its own memory space.

The hardware protection mechanism used to provide this goal is virtual memory. This is one of the coolest hardware mechanisms we’ll see, because it’s so powerful and flexible. It supports way more than protection; it can be used to make code much faster, and even to support cool features like memory-mapped I/O (the mmap system call).

Virtual memory distinguishes the addresses used to refer to physical memory chips, which are called physical addresses, from the addresses used by software and instructions, which are called virtual addresses. A functional procedure, implemented by the processor, maps virtual addresses to physical addresses. Different processes can use different procedures, giving them entirely different views of memory.

Different processors implement this procedure in different ways, but we will focus on the mechanism used by x86-64, which is a multi-level page table.

Ternary VM

Imagine an architecture with a total of 3³ = 27 different addresses. We’ll write addresses for this architecture in ternary form—for example, the address 012 represents 0×3²+1×3+2 =

Each of those “digits” could be called a trit (a ternary bit).

A multi-level page table divides addresses into parts. The least significant part is called the offset; the more significant parts are called indexes. In our architecture, the most-significant trit is the level 1 index; the middle trit is the level 2 index; and the least-significant trit is the offset.

Address   a b c
          | | |
          | | \---> offset
          | \-----> level 2 index
          \-------> level 1 index

Or, in code:

PAGEOFFSET(addr) = addr % 3
L1PAGEINDEX(addr) = (addr / 3) % 3
L2PAGEINDEX(addr) = addr / 9

The processor translates virtual to physical addresses using the indexes and offset. The lookup procedure starts from a special-purpose register called %cr3 (the same name as in x86-64). This register holds the physical address of the level 1 page table page (also called the “level 1 page table”). (All general-purpose registers hold virtual addresses, as do %rip, %rsp, etc.; really only %cr3 and one or two other special-purpose registers hold physical addresses.) Each page table page holds an array of addresses. The level 1 index is used to index into this array, obtaining the physical address of the level 2 page table page. The level 2 index is used to index into that array, obtaining the physical address of the destination physical page, an array of bytes. Finally, the offset is used to index into that page, obtaining the physical address.

The following pseudo-code makes this concrete.

address_t ternary_virtual_to_physical(uintptr_t va, address_t cr3, int access_type) {
    cur_pt = cr3;
    for (int level = 0; level < 2; ++level) {
        index = (va / (3 ** (2 - level))) % 3;   // compute index
        entry = cur_pt[index];   // NB uses physical addressing
        if (access_type is not allowed by entry)
            FAULT;
        cur_pt = entry aligned to a page boundary;
    }
    return cur_pt + (va % 3);    // apply offset
}

Consider the following contents of ternary memory. We divide ternary memory up into “pages” consisting of three addresses each; all addresses in the same page share the same indexes, but can have different offsets. We write a page by listing the three offsets in order. The page addresses are physical, not virtual.

page 00*   page 01*   page 02*
  000        100        200
  010        110        210
  020        120        220

page 10*   page 11*   page 12*
  000        001        002
  000        001        002
  000        001        002

page 20*   page 21*   page 22*
  100        120        200
  110        120        210
  120        120        220

What happens if %cr3 holds physical address 000?

This is what we call an identity mapping: every virtual address is numerically identical to the corresponding physical address; the memory with virtual address N is located at physical address N. To see how, consider the virtual address 122. Here’s how the lookup works:

The level 1 index, 1, is applied to the level 1 page table page, which is at physical address 000. This gives physical address 010 for the level 2 page table page.
The level 2 index, 2, is applied to the level 2 page table page. This gives physical address 120 for the destination physical page.
The offset, 2, is applied to that page’s address, giving 122!

A similar thing happens with address 002.

The level 1 index, 0, is applied to the level 1 page table page, which is at physical address 000. This gives physical address 000 for the level 2 page table page. Note that this is the same address as the level 1 page table page, but now we are treating that physical memory as a level 2 page table page!
The level 2 index, 0, is applied to the level 2 page table page, giving physical address 000 for the destination physical page.
The offset, 2, is applied to that page’s address, giving 002!

But the identity mapping only holds when %cr3 is 000. If %cr3 is 100, for example, the mapping is not an identity. When %cr3 is 100, then only physical pages 000, 010, and 020 are accessible.

But real multi-level page tables have additional features that support distinguishing privileged and unprivileged accesses, and that allow us to cut off some lookups into off-limits memory. Consider this set of memory:

page 00*   page 01*   page 02*
  010        000         -
  110U       010        blub
   F         020         -

page 10*   page 11*   page 12*
  010        200U       200U
  120U       210U       220U
   F          F          F

page 20*   page 21*   page 22*
   -          -          -
 chirp       meow        -
   -          -         bark

We write “U” after an address to indicate that the corresponding entry is accessible to unprivileged processes; if an entry has no “U”, then unprivileged accesses to that entry will cause a fault. An entry that says “F” always causes a fault, whether or not the accessing code has privilege.

Here are some facts about this memory, which you should verify by drawing arrows and working through the examples.

If %cr3 = 000, then unprivileged code can only access physical pages 200 and 210. Specifically, virtual addresses 100-102 correspond to physical addresses 200-202, and virtual addresses 110-112 correspond to physical addresses 210-212.
If %cr3 = 000, then privileged code can additionally access physical pages 000, 010, and 020, using identity mappings for addresses 000-022.
If %cr3 = 000, then unprivileged code can access the bird and the cat (using what virtual addresses?). Privileged code can additionally access the fish.
If %cr3 = 000, then no code can access virtual address 120-222; these addresses fault.

If %cr3 = 100, then unprivileged code can only access physical pages 200 and 220. Specifically, virtual addresses 100-102 correspond to physical addresses 200-202, and virtual addresses 110-112 correspond to physical addresses 220-222.
If %cr3 = 100, then privileged code can additionally access physical addresses 000-022 using identity mappings. Note that %cr3 000 and 100 share the same level-2 page table page for kernel mappings.
If %cr3 = 100, then unprivileged code can access the bird and the dog (using what virtual addresses?). Privileged code can additionally access the fish.

So we see some shared memory (e.g., unprivileged code using either page table can access the bird at the same address), and some independent memory (e.g., virtual address 110 maps to different physical memory on the two page tables).