Kernel

Kernel and processes
Processor time as resource
Timer interrupts
Current privilege level
Protected control transfer
Memory isolation
Virtual memory
Page tables
Confused deputy attack

Kernel and processes

The kernel is the piece of software that runs with full machine privilege. This means the kernel has full privilege over all a computer’s resources.

(The kernel is the base part of an operating system, which refers to a software ecosystem, including both the privileged software that manages computer resources (the kernel) and the libraries and support programs that interface with that privileged software and make it more convenient. Windows, Linux, and Mac OS X are examples of operating systems.)

Processes are software programs in execution that run without full machine privilege. (Processes are often called unprivileged processes or user-level processes to emphasize their unprivileged status. “User level” is the opposite of “kernel”.)

A process is a program in execution. The relationship between a program and a process is like that between a recipe and a cake. A recipe is an inanimate list of instructions that, interpreted by a cook, can make something delicious. A program is an inanimate list of instructions—a file on disk—that, loaded into memory and interpreted by a processor, can make something magical. A process is a live instance of a program, running at a particular time, on a particular piece of hardware, dealing with a particular set of inputs.

The kernel’s purpose is to serve the needs of processes as a whole. It balances three goals:

Fairly and safely share machine resources among processes.
Provide safe and convenient access to machine resources by inventing abstractions for those resources (such as files, which abstract disks).
Ensure robustness and performance.

In modern operating systems, much kernel code aims to provide protection: ensuring that no process can violate the operating system’s sharing policies. This is because processes can have bugs. A process can crash, or enter an infinite loop, or attempt to take over the machine, maliciously or accidentally. So kernels should prevent mistakes in individual processes from bringing down the system as a whole.

Kernels can achieve these goals only with help from hardware. A running process executes on a processor; that processor executes the process’s instructions, one after another, as fast as possible. The kernel is not emulating the processor—that would be very slow. The processor does not validate each instruction with the kernel. Instead, the processor runs on behalf of the process, and executes most process instructions directly. Processors support special mechanisms, accessible only to privileged code (the kernel), that ensure that processes cannot run amok, and the kernel can still arbitrate resources.

Processor time as resource

One of the most fundamental machine resources is processor time (or CPU time): the fraction of time the processor spends executing one process’s instructions rather than another’s. The kernel aims to share processor time according to its policy.

Here’s a fundamental attack on fair sharing of processor time. It’s the worst attack in the world:

int main() {
    while (true) {
    }
}

An infinite loop. Compiled to x86-64 instructions, this might be

00000000000005fa <main>:
 5fa:   55                      push   %rbp
 5fb:   48 89 e5                mov    %rsp,%rbp
 5fe:   eb fe                   jmp    5fe <main+0x4>

The critical instruction is jmp 5fe, represented in bytes as eb fe, which spins the processor in a tight loop forever.

Aside. Why is this loop represented as 0xeb 0xfe? An instruction consists of an opcode (e.g., “push”, “mov”, “pop”) and some operands (e.g., “%rbp”, “5fe”). Here, the 0xeb part is the opcode. This opcode means “unconditional branch (jmp) by a relative one-byte offset”: when the instruction is executed, the %rip register will be modified by adding to it the signed offset stored as an operand. Here, that operand is 0xfe, which, considered as a signed 8-bit number, is -2. Remember that when an instruction executes, the initial value of %rip is always the address of the next instruction (because the processor must read the entire current instruction before executing it). Thus, adding -2 to %rip will reset %rip back to the start of the jmp.

Processors execute the instructions they’re given in a simple-minded, straightforward way. If a processor starts executing an infinite loop, how will any other instruction ever run?

You might think to solve this problem by never running untrusted code. For instance, instead of letting the processor directly run an untrusted program, the kernel could interpret the instructions; if it detected at run time that a process had run too long, it could switch to another process. This is a good idea, and some amazing virtual machine technologies use this insight. But it’s tremendously slower than letting a processor run instructions directly—maybe thousands of times slower. It is possible to run untrusted code, safely and efficiently, with a little more hardware support.

Timer interrupts

What is needed is a way to limit the time that any single process can run on the CPU. After that time elapses, the processor should interrupt its execution and switch to the kernel, giving the kernel a chance to run something else.

Machines accomplish this with a separate piece of hardware called the timer. This timer can be configured by the kernel to go off periodically in real time, such as once every millisecond. When the timer goes off, it sends an interrupt to the processor, which gives the processor the chance to run something else.

Timer interrupts are an almost inevitable consequence of the problem of infinite loops. Many other aspects of timer interrupt implementation also follow logically from the problem timer interrupts aim to solve.

Any process that runs for too long must be interrupted by a timer. Therefore, processes must not be allowed to configure the timer: if they could, they could disable the timer or set it to go off once a year.
The kernel should be allowed to configure the timer. Every operating system wants to prevent processes from monopolizing CPU time, but different operating systems enforce very different policies in detailed terms. (Some processes might have priority over others, for example.)
Since the kernel, which is software, can configure the timer, but processes, which are also software, cannot, the processor must support different privilege modes, so that attempts to configure the timer can be distinguished.
A timer interrupt can occur at any time during process execution. This doesn’t indicate a bug in the process—maybe the process is just executing a long-running task—so an interrupted process should be able to pick up exactly where it left off, with all of its registers restored to their original values.

This marks an important difference with function calls. In a function call, the caller voluntarily transfers control to another piece of software. Since the control transfer is voluntary, the caller can prepare for it and implement a calling convention, saving any important registers to the stack and restoring them later. But in an interrupt, the process involuntarily transfers control to the kernel. The process cannot fully prepare.

The processor and kernel’s interrupt handling mechanisms are carefully engineered to save all processor state, allowing the interrupted processes to resume later as if nothing had occurred. This is an instance of protected control transfer, also known as exceptional control transfer.

Current privilege level

We see that process isolation requires that computers have some resources that are protected based on software privilege. For example, the kernel must be able to configure the timer and handle interrupts, and processes must not. However, kernel and processes are both software—sets of machine instructions that the processor executes. How can the processor allow the kernel to execute some instructions, while preventing processes from executing the same instructions?

The key is a hardware feature called the current privilege level. The current privilege level is contained in a special-purpose CPU register. Some instructions, which we call dangerous instructions, can only be executed if the current privilege level indicates that the currently-executing software has full machine privilege.

On x86-64, the current privilege level is a number between 0 and 3 stored in the lower 2 bits of the special-purpose %cs register. When the kernel is running, (%cs & 3) == 0 and the CPU allows the execution of all instructions, including dangerous instructions. When a process is running, (%cs & 3) == 3 and the CPU does not execute dangerous instructions; instead, if it is asked to execute a dangerous instruction, it raises a fault (a kind of exception), saves processor state, and starts running the kernel. This informs the kernel that a process tried to do something illegal. The kernel will generally kill (stop) the offending process.

`%cs & 3`	Meaning
0	Kernel (full machine privilege)
1–2	Unprivileged but unused in most operating systems
3	User (unprivileged)

QUESTION. Is changing %cs (for instance, movw $0, %cs) a dangerous instruction?

ANSWER. Yes! Any change to %cs that increases privilege—that is, that makes %cs & 3 lower—is dangerous. If a process attempts to claim privilege by modifying %cs, the CPU will fault and the kernel will run. The only way to increase privilege is to execute a protected control transfer (a system call or other execption).

Protected control transfer

Recall what we saw how the stdio library functions invoke system calls like read() and write(). We saw in assembly that a syscall instruction was invoked, and it only returns after the system call was finished. This syscall instruction is a key interface via which user processes can interact with the kernel. It implements a form of Protected Control Transfer -- it transfers control of the processor to the kernel in a safe and limited way.

Protected control transfer is safe because a process can only enter kernel at well-specified entry points. The process can't just jump to random code reside within the kernel.

Every process's address space contains a portion reserved for the kernel. (Usually this is the higher half of the address space, but in WeensyOS it’s the lower part of the address space—the part below PROC_START_ADDR.) One can write a program that attempts to jump some of these instructions in the kernel:

int main() {
    unsigned long kernel_insn = 0xffffffff80000100;
    void (*f)() = (void (*)())kernel_insn;
    f();
}

This program will crash with a segmentation fault, because a user process is not allowed to access anything reserved for the kernel directly. A user process can only invoke the kernel at specific entry points by using the syscall instruction.

QUESTION. Why must we only allow control transfer to kernel at specific points?

ANSWER. The kernel code executes in an environment that's privileged and unprotected, which means it has total and complete control over the machine. Preserving the integrity of the kernel's control flow is therefore extremely important, since we don't want any process on a computer to be able to execute abitrary kernel code in privelieged mode. A breach of this limitation can result in losing control over the machine to a malicous or misbehaving program.

The limitation and restriction guaranteed by protected control transfer is implemented by both the OS software and the hardware. This hardened interface between the user land and the kernel land is the conerstone of security in modern computer systems.

Virtual memory

The most fundamental resources on a computer are processor time, registers, and primary memory. Primary memory is RAM—random-access memory: the place where data is stored by address. Process isolation requires that each process have its own views of registers and primary memory. But RAM is stored in electrical components—transistors—and each computer has a RAM of fixed size, where the processor can access any byte of that RAM by address. How can a process be restricted from accessing other processes’ memories, or the kernel’s memory? If a process can modify the kernel’s instructions, then process isolation is fundamentally broken.

Modern computers implement memory isolation with a hardware feature called virtual memory. Just as with processor time, direct, fast access to memory from processes is so important that hardware support is used to implement protection.

We can model virtual memory as a mapping function \mathscr{P} : \textit{VA} \mapsto \textit{PA}. A virtual memory mapping function is like a pair of crazy glasses, or blinders, that constrain and modify the CPU’s view of memory. All memory addresses used by instructions pass through the virtual memory mapping function.

Let’s say an instruction accesses memory by address A. Here’s what happens.

The address A is actually a virtual address taken from the domain \textit{VA}. This address does not correspond to a byte in some RAM chip! Instead, to find the relevant piece of RAM, the CPU must use the current mapping function \mathscr{P}.
The processor therefore computes the value \mathscr{P}(A). This is a physical address taken from the domain \textit{PA}, or a special value denoting an illegal memory access that we call a memory fault or page fault. Physical addresses, unlike virtual addresses, do correspond to bytes in RAM chips.
If the instruction reads the byte at A, the processor will actually read physical memory byte \mathscr{P}(A). If the instruction writes the byte at A, the processor will actually write physical memory byte \mathscr{P}(A). (If \mathscr{P}(A) indicates a fault, the CPU will save the current register state and runs the kernel via a exceptional control transfer.)
The machine can have many mapping functions \mathscr{P}_0\dots\mathscr{P}_N, though only one mapping function is active at a time. These different functions can route the same virtual addresses to different physical addresses, thereby isolating different parts memory at different times. The current mapping function is defined using a special-purpose machine register—on x86-64, the %cr3 register.

QUESTION. Should modifying the %cr3 register be a dangerous instruction?

ANSWER. You bet. Modifying %cr3 can change the CPU’s current view of memory.

Page tables

Most modern architectures implement virtual memory mapping functions through a mechanism called page tables. A page table is an in-memory data structure, implemented on x86-64 as a tree with branching factor 512 and maximum depth 4. This data structure is set up and initialized by software—specifically the kernel—but used by the processor: the virtual memory system effectively uses the page table on every memory access to compute \mathscr{P}(A) for the supplied virtual address A. A page table thereby implements the filter, or glasses, through which memory is viewed.

Page tables on x86-64 and most other processors combine address translation with flags that can distinguish different kinds of access. Specifically, a page table defines, for each virtual address:

Whether the corresponding physical address exists.
- P (Present) bit: this part of the filter is OK to access
- If not set, any access causes a fault
Whether unprivileged access is allowed.
- U (Unprivileged) bit: this part of the filter is OK for unprivileged access
- If not set, any access by unprivileged software ((%cs & 3) != 0) causes a fault
Whether the memory is read-only.
- W (Writable) bit: this part of the filter is OK for writes
- If not set, any write access causes a fault
The physical address (assuming the above checks succeed).

This allows arbitrary rearrangement (or aliasing) of memory.

Virtual memory for a 64-byte memory

To make page tables concrete, we first describe a tiny, single-level page table for a machine with a one-byte address size, with just 64 meaningful addresses.

First, we divide memory into aligned blocks called pages. On this machine, we’ll say the page size is 8-every page is 8 bytes of memory, starting at an address that’s a multiple of 8. This splits memory addresses into two parts, the page index and the page offset. The following is a memory address in this architecture:

bit 5	bit 4	bit 3	bit 2	bit 1	bit 0
index			offset

Each of the architecture’s 8 memory pages comprises 8 bytes of memory, starting at an address that’s a multiple of 8. Each page is identified by an index. Within a page, there are 8 bytes—8 different addresses—each with a different offset.

A page table must provide enough information to map any virtual address to a corresponding physical address. It does so by defining page table entries (PTEs) that map a specific virtual pagewhich comprises a set of virtual addresses with the same virtual page index—to a physical page, as well as any permission bits to be used for that virtual page. Here’s an example for this architecture:

bit 5	bit 4	bit 3	bit 2	bit 1	bit 0
physical page index (physical address >> 3)			flag bits
physical page index (physical address >> 3)			U	W	P

In our 6-bit architecture, the lookup proceeds as follows (with virtual address va and access type at):

Start from physical address %cr3 (location of the page table)
Access physical memory at %cr3[va >> 3]: this is the relevant page table entry. (va >> 3 is the page index)
Check flag bits; maybe fault.
If the access is OK, return (%cr3[va >> 3] & 0b111000) | (va & 7) as the physical address.

Toward x86-64: Single level page table

x86-64, of course, has way more than 64 addresses. How can we design a page table format that works for 48-bit addresses?

Wait a minute, I thought x86-64 addresses were 64 bits! That’s right, they are, but in current processors, the 16 upper bits—bits 48–63—are reserved for future expansion. The value of each bit 48–63 must equal the value of bit 47, so only 2⁴⁸ of the 2⁶⁴ possible addresses are valid—specifically, 0x0000'0000'0000'0000–0x0000'7FFF'FFFF'FFFF and 0xFFFF'8000'0000'0000–0xFFFF'FFFF'FFFF'FFFF. (But some recent processors support 57-bit addresses!)

Let's first consider a single level page table. Here, a virtual address contains one index into the page table and an offset that is always 12 bits. When we use that index to find the corresponding entry in the page table page, we get a physical page address. The offset from the virtual address tells us the offset into the physical page.

Question: Why is the offset 12 bits?

Answer: We define the size of a page to be 4096 = 2^12 bytes. We want to be able to index into any byte of a destination physical page. So, we need 12 bits to represent every possible offset.

Question: Why is a single level not good enough?

Answer: Because we would need 2^39 bytes of data, which is too much memory! 2^39 = 2^36 * 2^3. The 2^36 comes from the maximum value the index can represent (an address is 64 bits and we reserve 12 bits for the offset, so we have 36 bits remaining for the index). The 2^3 comes from an address being 2^3 = 8 bytes.

x86-64 page tables

As we just saw, a single level page table takes up a lot of space! We can save some space by using multiple levels. We can think of a multi-level page table structure as a tree. Multiple levels leads to less space because when we look up the physical page for a given virtual address, we may visit a branch of the tree that tells us there's actually no valid physical page for us to access. In that case, we just stop searching. So, multiple levels means we can have a sparse tree.

Consequences for virtual addresses

The x86-64 architecture uses 4 levels. This is reflected in the structure of a virtual address:

63-48	47-39	38-29	29-21	20-12	11-0
	L4	L3	L2	L1	OFFSET

In x86-64 virtual address has 64 bits, but only the first 48 bits are meaningful. We have 9 bits to index into each page table level, and 12 bits for the offset. This means 16 bits are left over and unused.

Question: Why does each index get 9 bits?

Answer: Because the size of one page is 2^12 bytes, and each page table page entry is an address which has 2^3 bytes. 2^12/2^3 = 2^9 entries per page. We want to be able to index into any given entry, which means we need 9 bits.

%cr3

We store the physical address of the top level (L4) page table in a special register: %cr3.

Question: Why does %cr3 store a physical address, when every other register stores a virtual address?

Answer: Page tables are used to convert virtual addresses to physical addresses, so if we stored our top level page table address as a virtual address then we wouldn't know how to convert it to a physical address!

The lookup process

A successful lookup (finding a physical address from a virtual address) goes as follows:

Use %cr3 and the L4 index from the virtual address to get the L3 page table address
Use the L3 page table address and the L3 index to get a L2 page table address
Use the L2 page table address and the L2 index to get a L1 page table address
Use the L1 page table address and the L1 index to get the destination physical page
Use the destination physical page and the offset to get actual physical address within that destination physical page

Flags

Each entry in a page table is an address with the following structure:

63	62-48	47-12	11-3	2	1	0
NX		Physical address		U	W	P

Bits 0-2 contain the P (present), W (writable), and U (user-accessible) flags. However, we can also have other flags, like the NX (non-executable stack) bit, which prevents us from executing instructions on the stack. This is important for preventing buffer overflows!

Again, page table entries don't have an offset because we used the physical addresses they store to find the start of another page table page or a destination physical page. We use the bits in a virtual address to access specific locations in those pages.

WeensyOS

The goals of pset 4 are to add process isolation, and implement the fork and exit system calls. In the handout code, there is no process isolation because each process has the same page table!

We run WeensyOS in QEMU, which emulates hardware for an x86-64 architecture.

Kernel and user addresses

The kernel (designated as K) lives in addresses that start with 0x40000, while processes live in the upper addresses of virtual memory, starting at 0x10000.

We also have hardware pages (designated as R) in the middle of the physical address space. This is because Bill Gates once said "no one should ever need more than 640K of memory", and processors would give us 1 MB of memory. The hardware lives in the upper portion of that memory (between 640 K and 1 MB).

The hardware includes one page marked as C for the CGA console. The console is an instance of memory mapped I/O. The console is not memory, but behaves like it; we can write output to the console by writing values in memory

Control transfer

We want to be able to switch into the kernel from user space and vice versa. All of these entry/exit points are defined in k-exception.s. You won't need to modify this file, but you should understand what it does.

When we have an exception (e.g. a timer interrupt or a segfault), or the user makes a syscall, we need to switch into the kernel. First, we save processor state by saving each register. The process's %rsp is pre-saved on the kernel stack. Then, we jump into a specific place in the kernel, which is determined by the entry code in k-exception.s.

When we want to give control back to the user process, we simply restore the registers we already saved.

System calls and protected control transfer

The following is a list of actions that will occur, either by the process or by the kernel, once a process invokes a system call:

The process sets up arguments of the system call in registers, according to the system call's calling convention.
The process invokes the syscall instruction, which initiates a protected control transfer to the kernel.
The kernel starts executing from a pre-defined entry point, and executes a handler of the system call.
The kernel finishes processing the system call, and it picks another process to resume execution.

In the user space, a process has a system call "wrapper", which is a stub function that sets up the necessary state for executing a system call, and actually invokes the syscall instruction to transfer control to the kernel. Let's take look at the following system call, sys_getsysname, and this is its wrapper in the user space, defined in process.hh:

inline int sys_getsysname(char* buf) {
    register uintptr_t rax asm("rax") = SYSCALL_GETSYSNAME;
    asm volatile ("syscall"
                  : "+a" (rax), "+D" (buf)
                  :
                  : "cc", "rcx", "rdx", "rsi",
                    "r8", "r9", "r10", "r11", "memory");
    return rax;
}

The list of arguments cc, rcx, ... at the very end of the inline assembly tells the compiler that all these registers will be destroyed by the system call. Because system call is different from a normal function call, the compiler needs to be explicitly informed about its calling convention.

The kernel's handler of this system call is located in kernel.cc, within the syscall() function. The relevant part is shown below:

case SYSCALL_GETSYSNAME: {
    const char* osname = "DemoOS 61.61";
    char* buf = (char*) current->regs.reg_rdi;
    strcpy(buf, osname);
    return 0;
}

It appears that the kernel is getting an argument of the system call, which is the buffer where to put the system call name, from current->regs.reg_rdi. Let's break it down how it works:

Each process has a process descriptor structure, maintained by the kernel.
When a syscall instruction is invoked, the kernel takes over, and it can access the process's process descriptor via pointer current.
The process descriptor maintains many information about the process, including the register state of the process right before the protected control transfer occurred. Some work is done by the kernel to copy all that information to the process descriptor (in this case, all register values are copied to a struct called reg within the process descriptor), more on that later.
Since register %rdi points to the buffer before syscall is executed (normal x86 calling convention), the kernel can access its value using current->reg.reg_rdi.

Let's explain in detail how the kernel saved all the information to the process descriptor after a syscall instruction gets invoked. The kernel doesn't start executing the syscall() function directly once the protected control transfer occurs. The actual entry point for syscall instructions is defined in k-exception.S, line 135:

syscall_entry:
    movq %rsp, KERNEL_STACK_TOP - 16 // save entry %rsp to kernel stack
    movq $KERNEL_STACK_TOP, %rsp     // change to kernel stack

    // structure used by `iret`:
    pushq $(SEGSEL_APP_DATA + 3)   // %ss
    subq $8, %rsp                  // skip saved %rsp
    pushq %r11                     // %rflags
    pushq $(SEGSEL_APP_CODE + 3)   // %cs
    pushq %rcx                     // %rip

    // other registers:
    subq $8, %rsp                  // error code unused
    pushq $-1                      // reg_intno
    pushq %gs
    pushq %fs
    pushq %r15 // callee saved
    pushq %r14 // callee saved
    pushq %r13 // callee saved
    pushq %r12 // callee saved
    subq $8, %rsp                  // %r11 clobbered by `syscall`
    pushq %r10
    pushq %r9
    pushq %r8
    pushq %rdi
    pushq %rsi
    pushq %rbp // callee saved
    pushq %rbx // callee saved
    pushq %rdx
    subq $8, %rsp                  // %rcx clobbered by `syscall`
    pushq %rax

    // load kernel page table
    movq $kernel_pagetable, %rax
    movq %rax, %cr3

    // call syscall()
    movq %rsp, %rdi
    call _Z7syscallP8regstate

    // load process page table
    movq current, %rcx
    movq (%rcx), %rcx
    movq %rcx, %cr3

    // skip over other registers
    addq $(8 * 19), %rsp

    // return to process
    iretq

We can see that the kernel eventually calls the syscall() function before returning to the process, but a lot of setup work is done before the function call. Here is a high-level summary about what these setups are about:

Save and set %rsp so that the kernel starts using its own stack, instead of the process' stack.
Set up a structure, on the kernel stack, used by the iretq instruction to return to the process after the system call finishes.
Push all other user registers to the kernel stack.
Up until this point the kernel is running using the process's page table, switch the hardware (by setting %cr3) to use the kernel page table.
Call the C++ function syscall(), using the register structure we just saved on the stack as argument.
Restore to the process's page table, and return to the process.

In the syscall() function we copy the register structure passed in to the regs field in the process descriptor.

When a system call "returns", you can think of it as that all register values in current->regs gets copied (or restored) to the actual processor's registers before the process resumes execution.

System call handing is a bit like programming with exceptions. If you have experience with exceptions in another programming language, it may help you understand system calls. Every time a new system call occurs and the handler gets executed, the handler is not aware of any prior invocations of the same handler, unless mediated by some other state explicitly managed by the handler. The entire kernel stack gets "blown away" once a system call "returns". You can also think of it as a event-driven programming model.

Types of protected control transfers

Interrupts: caused by non-CPU hardware (e.g. timer)
Traps: caused by software intentionally (syscall)
Faults: caused by software error

It's worth pointing out that all these control transfers are administered by the x86 CPU.

When a process "returns" from a protected control transfer, kernel restores its %rip register to point to:

Interrupts: the next instruction that hasn't been executed yet.
Traps: the next instruction following the trap.
Faults: the problematic instruction causing the fault.

TL;DR: If a process enters the kernel because of an interrupt or trap, then once it resumes execution it will pick up from where it left off. If a process enters the kernel because of a fault, then it will retry the faulty instruction if it resumes.

Q: How can the kernel, in the system call handler, simply overwrite register values from the process? Won't that mess up the process's state and cause it to crash?

Answer: System calls are traps (also called synchronized events), which means they are explicitly invoked by the process, and the process expects a control transfer to occur and respects the calling convention of such control transfers. This is in contrast to interrupts and faults where such control transfers occur without the process's knowledge. In the case of system calls, calling convention designates certain registers to be overwritten by the kernel to convey information regarding results of the system call (%rax, for example hold return value of the system call), so the kernel are free to overwriting these registers. In fact, without using these registers, it becomes rather difficult to convey results of a system call unless the kernel exposes parts of its own memory to the process. It is worth noting though that the kernel can't just overwrite process registers arbitrarily.

Let's take a look at an example of an interrupt next, the timer interrupt ( kernel.cc:241):

case INT_TIMER:
    ++ticks;
    schedule();
    break;      /* will not be reached */

The code above is located in an exception handler, which is similar to the system call handler in terms of how it saves the process's state to the process descriptor. We see it simply increments a ticks variable, and calls the schedule() function, which picks another process to execute on the processor. Note that in this interrupt handling code no modifications were made to the process's register state. The process does not expect a timer interrupt to occur so we had better make them transparent to the process. By not modifying any registers we achieve this goal.

The sys_yeild system call, which is similar to the timer interrupt, has the following relevant code in the system call handler:

case SYSCALL_YIELD:
    current->regs.reg_rax = 0;
    schedule();             // does not return

Here we can modify the process's %rax state because again, it is a system call, and the process expects the occurrence of a control transfer and the overwriting of value in %rax by the kernel.

So, we have a super well-isolated operating system, called DemoOS, and there is absolutely nothing a program can do to take over the machine.

Absolutely nothing.

Or, is there?

Alice and Eve

We now look at two programs written by two rivals, Alice and Eve. They will be running on DemoOS.

This is Alice, in p-alice.cc:

#include "process.hh"
#include "lib.hh"

void process_main() {
    char buf[128];
    sys_getsysname(buf);
    app_printf(1, "Welcome to %s\n", buf);

    unsigned i = 0;
    while (1) {
        ++i;
        if (i % 512 == 0) {
            app_printf(1, "Hi, I'm Alice! #%d\n", i / 512);
        }
        sys_yield();
    }
}

And this is Eve, in p-eve.cc:

#include "process.hh"
#include "lib.hh"

void process_main() {
    unsigned i = 0;

    while (1) {
        ++i;
        if (i % 512 == 0) {
            app_printf(0, "Hi, I'm Eve! #%d\n", i / 512);
        }
        sys_yield();
    }
}

We can see both Alice and Eve contain infinite loops, but programs are being nice! By explicitly invoking the sys_yield() system call, they are "yielding" (i.e. letting another process to run) precious CPU time to each other so that both of them can make progress.

If our DemoOS is as wonderful as we claimed, there is absolutely nothing Eve can do to prevent Alice from running, and vice versa. The two always get about equal share of CPU resources.

Attack 1: No yielding any more

If Eve stops calling sys_yeild(), then Eve no longer actively yields the CPU to any other program, and it takes over the entire machine.

This problem occurs because timer interrupt is properly configured. After initializing the timer interrupt, Eve no longer gets to take over the machine. Alice gets chance to run again, although Eve still visibly uses more resources from the machine by not yielding. This strictly speaking doesn't provide fairness, but it is a reasonable policy an OS may choose to implement.

Attack 2: Disabling interrupts

Eve disables interrupts by using the cli instruction. Alice again gets no CPU time after Eve successfully executes this instructions.

We fix it by disallowing processes to control interrupts. Now Eve is not able to execute cli, or it will crash.

When Eve attempts to execute the no-longer-allowed cli instruction, the hardware generates a fault and transfers control to the kernel. We made the kernel handle this fault as an exception, and it falls under case INT_GPF , or general protection fault. This is a catch-all default type of fault the hardware throws when the reason of the the error doesn't fall under any of the more specific types of fault.

Attack 3: I crash, you crash (divide by zero)

Eve then changes its program to contain a divide by zero error. When that instruction hits DemoOS crashes and nobody gets to run.

As it turns out divide by zero error triggers another hardware exception that was not handled by us. We don't really want to list all the exceptions one-by- one and write code to handle them all, because for many of them we do the exact same thing: kill the faulting process. We may be tempted to just add this code in the default case: for all unexpected exceptions, just kill the process.

We need to be careful here though, as we should really only do this if an exception stems from problems in user code. If the kernel throws an exception, it usually indicates a serious bug in the kernel and it's a bad idea to carry on with life as if it never happened. This can be done by adding a simple check under the default case:

default:
    if (regs->reg_cs & 3 != 0) {
        current->state = P_BROKEN;
    } else {
        panic("Unexpected exception %d!\n", regs->reg_intno);
    }
    break;

The least significant 2 bits of %cs register stores the privilege level the processor is running at before the fault occurred.

We can also have more fun with Eve. Imagine that we don't crash Eve's program when a divide-by-zero occurred, but to confuse her. We can do this by handling the divide-by-zero exception this way:

case INT_DIVIDE:
    current->regs.reg_rax = 61;
    current->regs.reg_rip += 2;
    break;

As per x86 specification, this should be enough to convince Eve that anything divides by zero is always 61. (The specification of the idiv instruction says that the quotient of the division is stored in %rax). We also incremented Eve's %rip because divide by zero is a fault, and %rip saved by the kernel will point to the faulty instruction, which is the division instruction. Without changing %rip Eve would re-execute the idiv instruction once it resumes. We move past the divide instruction by adding 2 to %rip (the idiv instruction is 2-byte-long). This shows the control and power the kernel has over a process.

Attack 4: Jump-to-kernel

Eve now examines the kernel assembly and finds out that the syscall entry point is located at address 0x40ac6. She then made her program write two magical bytes to that location: 0xeb 0xfe. The two bytes form an evil instruction that jumps to itself: another infinite loop attack! Now whenever a system call is made, the kernel enters an infinite loop, and the machine hangs.

Infinite loops in the kernel is particularly disastrous because the kernel usually runs with interrupts disabled.

This attack can succeed because DemoOS doesn't properly implement kernel memory isolation -- a user process can access (read and write) any kernel memory that's mapped in the process's address space. We isolate the kernel by setting the proper permissions for kernel memory:

for (vmiter it(kernel_pagetable, 0);
     it.va() < MEMSIZE_PHYSICAL;
     it += PAGESIZE) {
    // Don't set the U bit, except for the console page
    if (it.va() != (uintptr_t) console) {
        it.map(it.pa(), PTE_P | PTE_W);
    }
}

After adding protection for kernel memory, Eve crashes after attempting to overwrite the syscall entry point, and Alice gets to run till the end.

Epilogue

Process isolation is still far from achieved in DemoOS. There are several kinds of attacks Eve can still perform against Alice. Just to name a couple:

Eve can clobber Alice's memory
Fork bomb...

Thanks William for playing Eve from MIT. +++ title = "Kernel 5: Confused deputy attack, scheduling, and process management" navgroup = "lectures" weight = 450 +++

Confused deputy attack

A confused deputy attack occurs when the attacker has low privilege, but the attacker convinced a privileged deputy to complete the attack on its behalf.

In the context of operating systems, a process is unprivileged, the kernel has full privilege and acts as a privileged deputy by handling system calls. A confused deputy attack may occur if the process, by invoking system calls, can somehow convince the kernel to execute a privileged attack.

Certain system calls are vulnerable to such attacks, but others are not. Very simple system calls like sys_getpid usually aren't susceptible to these attacks because they don't change the state of the kernel or other processes at all -- it simply copies the current process's pid value to its %rax register, and that's it. It's so simple that there is little room for bad things to happen.

Recall that at the end of last lecture, Eve tried to overwrite the syscall entry point in kernel memory with an infinite loop. We fixed this by denying access to kernel memory from the user level. Now Eve cannot just directly write to kernel memory any more, but is it possible to for Eve to convince the kernel to write malicious data/code to kernel memory?

Let's take a look at sys_getsysname. Could it be used by Eve to perpetrate a confused deputy attack? What if Eve simply does the following:

char* syscall_entry_addr = (char*)0x40ac6;
sys_getsysname(syscall_entry_addr);

Eve actually manages to crash the entire OS by adding just these 2 lines of code! The kernel did not perform any sanity checks before writing the string containing the OS name to the user-supplied buffer. In this case the buffer happens to point to the syscall entry point in kernel memory, and the kernel happily overwrote it. A successful confused deputy attack!

It's worth pointing out that Eve never directly wrote to any kernel memory. The kernel overwrote part of its own memory because the buffer passed by Eve points to kernel memory.

You may wonder why would a buffer in Eve's address space point to a critical component in the kernel's address space. It is true that when the kernel performs the string copy, it uses Eve's page table to perform address translation (for the destination of the copy). However, note that in DemoOS (and WeensyOS), the part of the page table mapping below PROC_START_ADDR (where the syscall entry point is located) is shared among all processes as well as the kernel. So this problematic address Eve passes to the kernel will translate into the same physical address, in kernel memory, regardless of which page table (including the kernel's page table) is used for address translation. Additionally, because processes and the kernel also share the same physical pages for mappings below PROC_START_ADDR, overwriting instructions there has a global effect, meaning that all process's (and the kernel's) syscall entry point code will get corrupted.

Eve is still not satisfied, because this attack blows away all processes in the system, including Eve itself. Eve would like a more targeted attack against only Alice. Eve could try, via a confused deputy attack, to corrupt Alice's register file in its process descriptor.

With information about kernel memory layout and Alice's PID, it is possible for Eve to figure out where Alice's process descriptor is located in memory. Eve then once again uses the sys_getsysname() system call to corrupt Alice's process descriptor, and after the system call is executed, Alice crashes and only Eve gets to run on the system.

Confused deputy attack can be far more devious and do way more damage than demonstrated. For example, if a malicious process can somehow convince the kernel to turn on certain bits in its page table, it could suddenly gain access to kernel memory and even inspect/change other process's memory. Instead of simply causing the victim to crash, it can passive monitor and steal information from the victim or do more.

To prevent such attacks, the kernel should always perform checks on user-supplied inputs before acting on them. In this example of sys_getsysname(), we should make sure that the buffer address supplied by the user is mapped as user-accessible in the process's page table. A safer version of the sys_getsysname() handler should look like the following:

case SYSCALL_GETSYSNAME: {
    const char* osname = "DemoOS 61.61";
    char* buf = (char*) current->regs.reg_rdi;

    // Check that the entire span of the buffer is mapped
    // as user-accessible
    size_t len = strlen(osname) + 1;
    size_t i = 0;
    for (vmiter it(current, (uintptr_t) buf);
         i < len;
         it += 1, ++i) {
        if (!it.user()) {
            return -1;
        } else {
            // Performs the actual copy
            // Note that the destination of the copy is
            // address-translated using the process's
            // page table.
            *((char*) it.pa()) = osname[i];
        }
    }

    return 0;
}

We briefly note here that sys_page_alloc() has a similar vulnerability. Eve could request to map a new physical page to the virtual page containing the syscall entry point instructions in its address space. It only affects Eve's page table and has no global effect, but it grants Eve control over its syscall entry point, and Eve can put arbitrary code there. The next time Eve executes a system call, the inserted code within the newly mapped physical page will run in privileged mode, and it can easily compromise the entire machine. Note that in this case no direct corruption of kernel memory occurred, but a confused deputy attack can still take place.

Contents