Kernel 5
Confused deputy problems in kernels
Confused deputy problems are a specific form of privilege escalation: a privileged entity, such as a kernel, is fooled into doing inappropriate work on behalf of a less-privileged entity.
Imagine a kernel with two processes, Alice and Eve. Process memory is
correctly isolated by the kernel: Eve can only access her own virtual
memory; any attempt to read or write other memory, such as Alice’s
memory, will cause a fault. But what if a system call implementation in
the kernel contained a memcpy
whose arguments are not fully
checked?? Then, most likely, Eve could call that system call with bogus
arguments, causing the kernel to read or write other memory on behalf
of Eve! Since the kernel has full system privilege, this will not cause
a fault, and Eve’s evil intentions will be carried out. The kernel is
the confused deputy here. It is acting on behalf of Eve (making it Eve’s
deputy), but it did not check whether its operations were valid for Eve
to perform (making it confused).
Explicit permission checks are key to avoiding confused deputy problems. When a privileged entity performs work on behalf of a less-privileged entity, the privileged entity should check every argument the less-privileged entity supplied.
The OS in class had some pretty serious confused deputy problems, centering on its RAM disk. A RAM disk is like an abstraction of a buffer cache: a part of RAM reserved for file-like storage. Our RAM disk features two system calls:
ssize_t sys_read_ramdisk(void* buf, off_t off, size_t sz);
ssize_t sys_write_ramdisk(const void* buf, off_t off, size_t sz);
The read_ramdisk
system call copies sz
bytes of data out of the
RAM disk into the user’s buffer buf
, starting at offset off
. The
write_ramdisk
system call copies sz
bytes of data into the RAM
disk from the user’s buffer buf
, starting at offset off
. In each
case, the return value is the number of bytes copied, or -1 on error.
A properly-secured system call implementation would check each of these arguments. Specifically, it would ensure that:
off+sz
does not overflow.- All bytes in the range [
off
,off+sz
) lie within the RAM disk. buf+sz
does not overflow.- For
read_ramdisk
, all bytes in the range [buf
,buf+sz
) are writable by the process. - For
write_ramdisk
, all bytes in the range [buf
,buf+sz
) are readable by the process.
These checks are necessary and sufficient.
But the actual kernel from class did none of this:
case INT_SYS_READ_RAMDISK:
case INT_SYS_WRITE_RAMDISK: {
char* buf = (char*) current->p_registers.reg_rdi;
uintptr_t off = current->p_registers.reg_rsi;
uintptr_t sz = current->p_registers.reg_rdx;
if (current->p_registers.reg_intno
== INT_SYS_READ_RAMDISK) {
memcpy(buf, ramdisk + off, sz);
current->p_registers.reg_rax = sz;
} else {
memcpy(ramdisk + off, buf, sz);
current->p_registers.reg_rax = sz;
}
break;
}
Oops!
Using offset and size
First, by choosing an appropriate off
and sz
, Eve could use
sys_write_ramdisk
to write garbage into Alice’s process structure,
causing Alice to crash.
We saw a version of Eve’s code that actually did detective work on
kernel memory, using sys_read_ramdisk
, to find Alice’s process
structure. Here’s the code:
// Eve thinks the process structure looks like this...
typedef struct hypothetical_proc {
pid_t hp_pid;
x86_64_registers hp_reg;
char hp_other_stuff[0]; // not sure how big
} hypothetical_proc;
app_printf(0, "Searching for Alice in ramdisk...\n");
for (ssize_t off = -0x2000; off < 0x2000; off += 8) {
hypothetical_proc hp;
sys_read_ramdisk(&hp, off, sizeof(hp));
if (hp.hp_pid == 2
&& hp.hp_reg.reg_rip >= 0x100000
&& hp.hp_reg.reg_rip <= 0x108000
&& hp.hp_reg.reg_rsp >= 0x100000
&& hp.hp_reg.reg_rsp <= 0x180000) {
app_printf(0, "@%zx: pid=%d rip=%zx rsp=%zx\n",
off,
hp.hp_pid,
hp.hp_reg.reg_rip,
hp.hp_reg.reg_rsp);
app_printf(0, "Found her!\n", off);
app_printf(0, "EVE REKT\n");
hp.hp_reg.reg_rip = 0x10000000;
sys_write_ramdisk(&hp, off, sizeof(hp));
sys_yield(); // let her die
}
}
Eve searches for an area of memory that looks like Alice’s process structure: it has Alice’s process ID, and it has %rip and %rsp registers in the right regions.
The kernel can add checks for off
and sz
to catch this error.
Using the buffer
Even after we fixed the off
and sz
checks, Eve could use an
inappropriate buf
to cause sys_read_ramdisk
to write garbage
into Alice’s process structure, causing Alice to crash!
In class, this code (which, unlike the previous example, relies on known address arguments) killed Alice:
hypothetical_proc hp;
hp.hp_reg.reg_rip = 0x10000000;
sys_write_ramdisk(&hp.hp_reg.reg_rip, 0, 8);
sys_read_ramdisk((unsigned char*) 0x49000
+ 0xd8 * 2
+ ((uintptr_t) &hp.hp_reg.reg_rip - (uintptr_t) &hp),
0, 8);
The unchecked buf
argument lets Eve convince the kernel to overwrite
Alice’s process structure, by giving the kernel address of that
structure!
The kernel can add checks for buf
to catch this error. It’s not
really enough to check the numeric value of buf
. In addition to
overflow checking, a wise kernel will check the virtual memory
permissions for the whole [buf
, buf+sz
) range, which on
WeensyOS would use code like this:
` int want_perm = PTE_P | PTE_U; // NB `PTE_P` is redundant `
if (current->p_reg.reg_intno == INT_SYS_READ_RAMDISK) {
want_perm |= PTE_W;
}
for (uintptr_t va = (uintptr_t) buf; va < (uintptr_t) va + sz;
va = ROUNDUP(va + 1, PAGESIZE)) {
vamapping m = virtual_memory_map(current->p_pagetable, va);
if ((m.perm & want_perm) != want_perm) {
// bad permissions!
current->p_reg.reg_rax = -1; // Fault
goto syscall_done;
}
}
Note the somewhat-complicated increment of va
here. A simpler
page-based increment, such as va += PAGESIZE
, might not check every
necessary page. A byte-based increment, such as ++va
, would check
everything required, but would make redundant checks.
Exception review
Exceptions come in three varieties: interrupts, faults, and traps.
Interrupts are caused by hardware devices; a timer interrupt is an
example. Faults are caused by software errors, such as page faults
(invalid memory accesses) and protection faults (attempts to execute
dangerous instructions at unprivileged CPL). Traps are caused by
software intentionally, such as explicit int
instructions or
syscall
instructions.
Each exception initiates a protected control transfer to the kernel, but
there are important differences. Every protected control transfer pushes
the unprivileged process’s instruction pointer onto the kernel stack,
allowing the kernel to eventually resume the unprivileged process. For
traps and interrupts, the saved instruction pointer points to the next
instruction to execute. For example, in WeensyOS, the instruction
pointer pushed for a system call is the address of the instruction
after the int
instruction. The processor hardware ensures that the
previous instruction has executed to completion—all its effects are
represented in registers and memory—and then it pushes the next
instruction’s address and invokes the kernel.
Faults are handled differently. The saved instruction pointer points to the faulting instruction. For example, in WeensyOS, the instruction pointer pushed for a page fault is the address of the instruction that caused the fault. The processor hardware undoes any of the faulting instruction’s partial effects, restoring registers and memory to their values before the faulting instruction began; pushes the faulting instruction’s address; and invokes the kernel.
The trap case is easy to reason about: the next instruction’s address is
pushed just as in a normal call
instruction. (The return address
pushed on the stack is also the address of the next instruction after
the call.) The fault case allows the kernel to restart the unprivileged
process exactly at the faulting instruction, allowing it to retry. And
this lets kernels build amazing virtual memory functionality.
Fault handlers implementing OS policy
In this section, we investigate a couple of OS policies and show how VM manipulations and kernel fault handlers can implement those policies.
Busyness tracking
OS Goal: Track how much VM a process is actively using.
Implementation:
- The kernel marks the process as using 0 pages, and then marks all process VM as temporarily inaccessible. (This “temporarily inaccessible” information is stored either in the page table, using bits reserved by the hardware for OS use, or in a secondary structure that mirrors the page table.)
- When process code accesses a page, it will fault.
- The fault handler examines the faulting address. If the address is temporarily inaccessible, the kernel restores the process’s access (marking the page as accessible), increments the process’s page counter, and restarts the process. This reruns the faulting instruction, which will this time succeed. (If the fault isn’t one the kernel understands, the process should be killed.)
Notes: This will cause at most one fault per accessed page, because after a fault the memory becomes accessible.
Marking all memory as inaccessible is expensive, but a kernel can get a pretty good estimate of busyness by sampling: marking 1/1000 of pages as inaccessible and counting how many of those pages cause faults.
A variant of this plan can be used to track how often memory is accessed, letting the operating system measure which memory has been least recently used.
x86-64 processor MMUs actually already track this information by setting
a specific bit in the page table for each accessed page (the PTE_A
bit). However, for very large processes, walking page tables to read
PTE_A
bits can be expensive.
To track memory usage over time, the kernel will periodically re-run this algorithm.
Demand-paged executables
OS Goal: When running a large executable, only load from disk those portions of the executable that are actively used.
Implementation:
- Rather than load an executable into process memory, the kernel marks the virtual memory corresponding to process code as inaccessible.
- When process code accesses a page of code, it will fault.
- The fault handler examines the faulting address. If the address corresponds to unloaded process code, the kernel will load the code at that point. Once the code is loaded and copied (or mapped) into process memory, the process is restarted.
This is called demand paging because pages are only loaded on demand.
Memory-mapped I/O
OS Goal: Support mmap
.
Implementation: Same as demand-paged executables, but now for any file that can exist in the buffer cache!
Swapping
OS Goal: Support processes that, in aggregate, require more memory than the machine has.
Implementation: Assume a process requires additional physical memory, but no physical memory is available.
- The operating system kernel selects a victim page and writes the contents of that page to disk, for example in a special area called the “swap area.” The victim page is marked inaccessible in any page table that references it.
- The evicted page (the victim page) is now available for use!
- If a process accesses the victim page later, the OS will repeat the eviction process, but then load the page data from the swap area back into memory.
Swapping on modern machines is viciously slow—disks are radically slower than memory—but it’s still super useful to support workloads that appear to be larger than primary memory. Serious problems won’t occur unless the working set of actively-used memory is larger than primary memory.
Copy-on-write fork
OS Goal: Make fork
fast by allowing processes to share access to
writable memory.
Generally, isolated processes should not be allowed to share writable memory—they are supposed to have their own memory spaces. But it’s OK to share memory that has the same contents. If two pages have the same contents, sharing and isolation are indistinguishable! The key idea behind the copy-on-write optimization is to share pages after a fork until they have different contents.
Implementation:
- When a process forks, don’t copy any of its memory. Instead, mark all memory as temporarily unwritable.
- When a process faults, the kernel checks if the fault is a write to a temporarily-unwritable page. If so, then the kernel copies the page at that point, installs a new writable mapping to the copy, and restarts the process. The process will then retry the write and continue as if nothing had happened.
Note that there’s no need to copy a page that has only one reference.