Kernel 5

Confused deputy problems in kernels

Confused deputy problems are a specific form of privilege escalation: a privileged entity, such as a kernel, is fooled into doing inappropriate work on behalf of a less-privileged entity.

Imagine a kernel with two processes, Alice and Eve. Process memory is correctly isolated by the kernel: Eve can only access her own virtual memory; any attempt to read or write other memory, such as Alice’s memory, will cause a fault. But what if a system call implementation in the kernel contained a memcpy whose arguments are not fully checked?? Then, most likely, Eve could call that system call with bogus arguments, causing the kernel to read or write other memory on behalf of Eve! Since the kernel has full system privilege, this will not cause a fault, and Eve’s evil intentions will be carried out. The kernel is the confused deputy here. It is acting on behalf of Eve (making it Eve’s deputy), but it did not check whether its operations were valid for Eve to perform (making it confused).

Explicit permission checks are key to avoiding confused deputy problems. When a privileged entity performs work on behalf of a less-privileged entity, the privileged entity should check every argument the less-privileged entity supplied.

The OS in class had some pretty serious confused deputy problems, centering on its RAM disk. A RAM disk is like an abstraction of a buffer cache: a part of RAM reserved for file-like storage. Our RAM disk features two system calls:

ssize_t sys_read_ramdisk(void* buf, off_t off, size_t sz);
ssize_t sys_write_ramdisk(const void* buf, off_t off, size_t sz);

The read_ramdisk system call copies sz bytes of data out of the RAM disk into the user’s buffer buf, starting at offset off. The write_ramdisk system call copies sz bytes of data into the RAM disk from the user’s buffer buf, starting at offset off. In each case, the return value is the number of bytes copied, or -1 on error.

A properly-secured system call implementation would check each of these arguments. Specifically, it would ensure that:

off+sz does not overflow.
All bytes in the range [off, off+sz) lie within the RAM disk.
buf+sz does not overflow.
For read_ramdisk, all bytes in the range [buf, buf+sz) are writable by the process.
For write_ramdisk, all bytes in the range [buf, buf+sz) are readable by the process.

These checks are necessary and sufficient.

But the actual kernel from class did none of this:

case INT_SYS_READ_RAMDISK:
case INT_SYS_WRITE_RAMDISK: {
    char* buf = (char*) current->p_registers.reg_rdi;
    uintptr_t off = current->p_registers.reg_rsi;
    uintptr_t sz = current->p_registers.reg_rdx;
    if (current->p_registers.reg_intno
        == INT_SYS_READ_RAMDISK) {
        memcpy(buf, ramdisk + off, sz);
        current->p_registers.reg_rax = sz;
    } else {
        memcpy(ramdisk + off, buf, sz);
        current->p_registers.reg_rax = sz;
    }
    break;
}

Oops!

Using offset and size

First, by choosing an appropriate off and sz, Eve could use sys_write_ramdisk to write garbage into Alice’s process structure, causing Alice to crash.

We saw a version of Eve’s code that actually did detective work on kernel memory, using sys_read_ramdisk, to find Alice’s process structure. Here’s the code:

// Eve thinks the process structure looks like this...
typedef struct hypothetical_proc {
    pid_t hp_pid;
    x86_64_registers hp_reg;
    char hp_other_stuff[0]; // not sure how big
} hypothetical_proc;

app_printf(0, "Searching for Alice in ramdisk...\n");
for (ssize_t off = -0x2000; off < 0x2000; off += 8) {
    hypothetical_proc hp;
    sys_read_ramdisk(&hp, off, sizeof(hp));
    if (hp.hp_pid == 2
        && hp.hp_reg.reg_rip >= 0x100000
        && hp.hp_reg.reg_rip <= 0x108000
        && hp.hp_reg.reg_rsp >= 0x100000
        && hp.hp_reg.reg_rsp <= 0x180000) {
        app_printf(0, "@%zx: pid=%d rip=%zx rsp=%zx\n",
                   off,
                   hp.hp_pid,
                   hp.hp_reg.reg_rip,
                   hp.hp_reg.reg_rsp);
        app_printf(0, "Found her!\n", off);
        app_printf(0, "EVE REKT\n");
        hp.hp_reg.reg_rip = 0x10000000;
        sys_write_ramdisk(&hp, off, sizeof(hp));
        sys_yield(); // let her die
    }
}

Eve searches for an area of memory that looks like Alice’s process structure: it has Alice’s process ID, and it has %rip and %rsp registers in the right regions.

The kernel can add checks for off and sz to catch this error.

Using the buffer

Even after we fixed the off and sz checks, Eve could use an inappropriate buf to cause sys_read_ramdisk to write garbage into Alice’s process structure, causing Alice to crash!

In class, this code (which, unlike the previous example, relies on known address arguments) killed Alice:

hypothetical_proc hp;
hp.hp_reg.reg_rip = 0x10000000;
sys_write_ramdisk(&hp.hp_reg.reg_rip, 0, 8);
sys_read_ramdisk((unsigned char*) 0x49000
                 + 0xd8 * 2
                 + ((uintptr_t) &hp.hp_reg.reg_rip - (uintptr_t) &hp),
                 0, 8);

The unchecked buf argument lets Eve convince the kernel to overwrite Alice’s process structure, by giving the kernel address of that structure!

The kernel can add checks for buf to catch this error. It’s not really enough to check the numeric value of buf. In addition to overflow checking, a wise kernel will check the virtual memory permissions for the whole [buf, buf+sz) range, which on WeensyOS would use code like this:

` int want_perm = PTE_P | PTE_U; // NB `PTE_P` is redundant `
if (current->p_reg.reg_intno == INT_SYS_READ_RAMDISK) {
    want_perm |= PTE_W;
}
for (uintptr_t va = (uintptr_t) buf; va < (uintptr_t) va + sz;
     va = ROUNDUP(va + 1, PAGESIZE)) {
    vamapping m = virtual_memory_map(current->p_pagetable, va);
    if ((m.perm & want_perm) != want_perm) {
        // bad permissions!
        current->p_reg.reg_rax = -1; // Fault
        goto syscall_done;
    }
}

Note the somewhat-complicated increment of va here. A simpler page-based increment, such as va += PAGESIZE, might not check every necessary page. A byte-based increment, such as ++va, would check everything required, but would make redundant checks.

Exception review

Exceptions come in three varieties: interrupts, faults, and traps. Interrupts are caused by hardware devices; a timer interrupt is an example. Faults are caused by software errors, such as page faults (invalid memory accesses) and protection faults (attempts to execute dangerous instructions at unprivileged CPL). Traps are caused by software intentionally, such as explicit int instructions or syscall instructions.

Each exception initiates a protected control transfer to the kernel, but there are important differences. Every protected control transfer pushes the unprivileged process’s instruction pointer onto the kernel stack, allowing the kernel to eventually resume the unprivileged process. For traps and interrupts, the saved instruction pointer points to the next instruction to execute. For example, in WeensyOS, the instruction pointer pushed for a system call is the address of the instruction after the int instruction. The processor hardware ensures that the previous instruction has executed to completion—all its effects are represented in registers and memory—and then it pushes the next instruction’s address and invokes the kernel.

Faults are handled differently. The saved instruction pointer points to the faulting instruction. For example, in WeensyOS, the instruction pointer pushed for a page fault is the address of the instruction that caused the fault. The processor hardware undoes any of the faulting instruction’s partial effects, restoring registers and memory to their values before the faulting instruction began; pushes the faulting instruction’s address; and invokes the kernel.

The trap case is easy to reason about: the next instruction’s address is pushed just as in a normal call instruction. (The return address pushed on the stack is also the address of the next instruction after the call.) The fault case allows the kernel to restart the unprivileged process exactly at the faulting instruction, allowing it to retry. And this lets kernels build amazing virtual memory functionality.

Fault handlers implementing OS policy

In this section, we investigate a couple of OS policies and show how VM manipulations and kernel fault handlers can implement those policies.

Busyness tracking

OS Goal: Track how much VM a process is actively using.

Implementation:

The kernel marks the process as using 0 pages, and then marks all process VM as temporarily inaccessible. (This “temporarily inaccessible” information is stored either in the page table, using bits reserved by the hardware for OS use, or in a secondary structure that mirrors the page table.)
When process code accesses a page, it will fault.
The fault handler examines the faulting address. If the address is temporarily inaccessible, the kernel restores the process’s access (marking the page as accessible), increments the process’s page counter, and restarts the process. This reruns the faulting instruction, which will this time succeed. (If the fault isn’t one the kernel understands, the process should be killed.)

Notes: This will cause at most one fault per accessed page, because after a fault the memory becomes accessible.

Marking all memory as inaccessible is expensive, but a kernel can get a pretty good estimate of busyness by sampling: marking 1/1000 of pages as inaccessible and counting how many of those pages cause faults.

A variant of this plan can be used to track how often memory is accessed, letting the operating system measure which memory has been least recently used.

x86-64 processor MMUs actually already track this information by setting a specific bit in the page table for each accessed page (the PTE_A bit). However, for very large processes, walking page tables to read PTE_A bits can be expensive.

To track memory usage over time, the kernel will periodically re-run this algorithm.

This is a great idea!

Demand-paged executables

OS Goal: When running a large executable, only load from disk those portions of the executable that are actively used.

Implementation:

Rather than load an executable into process memory, the kernel marks the virtual memory corresponding to process code as inaccessible.
When process code accesses a page of code, it will fault.
The fault handler examines the faulting address. If the address corresponds to unloaded process code, the kernel will load the code at that point. Once the code is loaded and copied (or mapped) into process memory, the process is restarted.

This is called demand paging because pages are only loaded on demand.

Memory-mapped I/O

OS Goal: Support mmap.

Implementation: Same as demand-paged executables, but now for any file that can exist in the buffer cache!

Swapping

OS Goal: Support processes that, in aggregate, require more memory than the machine has.

Implementation: Assume a process requires additional physical memory, but no physical memory is available.

The operating system kernel selects a victim page and writes the contents of that page to disk, for example in a special area called the “swap area.” The victim page is marked inaccessible in any page table that references it.
The evicted page (the victim page) is now available for use!
If a process accesses the victim page later, the OS will repeat the eviction process, but then load the page data from the swap area back into memory.

Swapping on modern machines is viciously slow—disks are radically slower than memory—but it’s still super useful to support workloads that appear to be larger than primary memory. Serious problems won’t occur unless the working set of actively-used memory is larger than primary memory.

Copy-on-write fork

OS Goal: Make fork fast by allowing processes to share access to writable memory.

Generally, isolated processes should not be allowed to share writable memory—they are supposed to have their own memory spaces. But it’s OK to share memory that has the same contents. If two pages have the same contents, sharing and isolation are indistinguishable! The key idea behind the copy-on-write optimization is to share pages after a fork until they have different contents.

Implementation:

When a process forks, don’t copy any of its memory. Instead, mark all memory as temporarily unwritable.
When a process faults, the kernel checks if the fault is a write to a temporarily-unwritable page. If so, then the kernel copies the page at that point, installs a new writable mapping to the copy, and restarts the process. The process will then retry the write and continue as if nothing had happened.

Note that there’s no need to copy a page that has only one reference.