Contents
- Kernel and processes
- Processor time as resource
- Timer interrupts
- Current privilege level
- Protected control transfer
- Memory isolation
- Virtual memory
- Page tables
- Confused deputy attack
Kernel and processes
The kernel is the piece of software that runs with full machine privilege. This means the kernel has full privilege over all a computer’s resources.
(The kernel is the base part of an operating system, which refers to a software ecosystem, including both the privileged software that manages computer resources (the kernel) and the libraries and support programs that interface with that privileged software and make it more convenient. Windows, Linux, and Mac OS X are examples of operating systems.)
Processes are software programs in execution that run without full machine privilege. (Processes are often called unprivileged processes or user-level processes to emphasize their unprivileged status. “User level” is the opposite of “kernel”.)
A process is a program in execution. The relationship between a program and a process is like that between a recipe and a cake. A recipe is an inanimate list of instructions that, interpreted by a cook, can make something delicious. A program is an inanimate list of instructions—a file on disk—that, loaded into memory and interpreted by a processor, can make something magical. A process is a live instance of a program, running at a particular time, on a particular piece of hardware, dealing with a particular set of inputs.
The kernel’s purpose is to serve the needs of processes as a whole. It balances three goals:
- Fairly and safely share machine resources among processes.
- Provide safe and convenient access to machine resources by inventing abstractions for those resources (such as files, which abstract disks).
- Ensure robustness and performance.
In modern operating systems, much kernel code aims to provide protection: ensuring that no process can violate the operating system’s sharing policies. This is because processes can have bugs. A process can crash, or enter an infinite loop, or attempt to take over the machine, maliciously or accidentally. So kernels should prevent mistakes in individual processes from bringing down the system as a whole.
Kernels can achieve these goals only with help from hardware. A running process executes on a processor; that processor executes the process’s instructions, one after another, as fast as possible. The kernel is not emulating the processor—that would be very slow. The processor does not validate each instruction with the kernel. Instead, the processor runs on behalf of the process, and executes most process instructions directly. Processors support special mechanisms, accessible only to privileged code (the kernel), that ensure that processes cannot run amok, and the kernel can still arbitrate resources.
Processor time as resource
One of the most fundamental machine resources is processor time (or CPU time): the fraction of time the processor spends executing one process’s instructions rather than another’s. The kernel aims to share processor time according to its policy.
Here’s a fundamental attack on fair sharing of processor time. It’s the worst attack in the world:
int main() {
while (true) {
}
}
An infinite loop. Compiled to x86-64 instructions, this might be
00000000000005fa <main>:
5fa: 55 push %rbp
5fb: 48 89 e5 mov %rsp,%rbp
5fe: eb fe jmp 5fe <main+0x4>
The critical instruction is jmp 5fe
, represented in bytes as eb fe
, which
spins the processor in a tight loop forever.
Aside. Why is this loop represented as
0xeb 0xfe
? An instruction consists of an opcode (e.g., “push”, “mov”, “pop”) and some operands (e.g., “%rbp”, “5fe”). Here, the0xeb
part is the opcode. This opcode means “unconditional branch (jmp
) by a relative one-byte offset”: when the instruction is executed, the%rip
register will be modified by adding to it the signed offset stored as an operand. Here, that operand is0xfe
, which, considered as a signed 8-bit number, is -2. Remember that when an instruction executes, the initial value of%rip
is always the address of the next instruction (because the processor must read the entire current instruction before executing it). Thus, adding -2 to%rip
will reset%rip
back to the start of thejmp
.
Processors execute the instructions they’re given in a simple-minded, straightforward way. If a processor starts executing an infinite loop, how will any other instruction ever run?
You might think to solve this problem by never running untrusted code. For instance, instead of letting the processor directly run an untrusted program, the kernel could interpret the instructions; if it detected at run time that a process had run too long, it could switch to another process. This is a good idea, and some amazing virtual machine technologies use this insight. But it’s tremendously slower than letting a processor run instructions directly—maybe thousands of times slower. It is possible to run untrusted code, safely and efficiently, with a little more hardware support.
Timer interrupts
What is needed is a way to limit the time that any single process can run on the CPU. After that time elapses, the processor should interrupt its execution and switch to the kernel, giving the kernel a chance to run something else.
Machines accomplish this with a separate piece of hardware called the timer. This timer can be configured by the kernel to go off periodically in real time, such as once every millisecond. When the timer goes off, it sends an interrupt to the processor, which gives the processor the chance to run something else.
Timer interrupts are an almost inevitable consequence of the problem of infinite loops. Many other aspects of timer interrupt implementation also follow logically from the problem timer interrupts aim to solve.
-
Any process that runs for too long must be interrupted by a timer. Therefore, processes must not be allowed to configure the timer: if they could, they could disable the timer or set it to go off once a year.
-
The kernel should be allowed to configure the timer. Every operating system wants to prevent processes from monopolizing CPU time, but different operating systems enforce very different policies in detailed terms. (Some processes might have priority over others, for example.)
-
Since the kernel, which is software, can configure the timer, but processes, which are also software, cannot, the processor must support different privilege modes, so that attempts to configure the timer can be distinguished.
-
A timer interrupt can occur at any time during process execution. This doesn’t indicate a bug in the process—maybe the process is just executing a long-running task—so an interrupted process should be able to pick up exactly where it left off, with all of its registers restored to their original values.
This marks an important difference with function calls. In a function call, the caller voluntarily transfers control to another piece of software. Since the control transfer is voluntary, the caller can prepare for it and implement a calling convention, saving any important registers to the stack and restoring them later. But in an interrupt, the process involuntarily transfers control to the kernel. The process cannot fully prepare.
The processor and kernel’s interrupt handling mechanisms are carefully engineered to save all processor state, allowing the interrupted processes to resume later as if nothing had occurred. This is an instance of protected control transfer, also known as exceptional control transfer.
Current privilege level
We see that process isolation requires that computers have some resources that are protected based on software privilege. For example, the kernel must be able to configure the timer and handle interrupts, and processes must not. However, kernel and processes are both software—sets of machine instructions that the processor executes. How can the processor allow the kernel to execute some instructions, while preventing processes from executing the same instructions?
The key is a hardware feature called the current privilege level. The current privilege level is contained in a special-purpose CPU register. Some instructions, which we call dangerous instructions, can only be executed if the current privilege level indicates that the currently-executing software has full machine privilege.
On x86-64, the current privilege level is a number between 0 and 3 stored in
the lower 2 bits of the special-purpose %cs
register. When the kernel is
running, (%cs & 3) == 0
and the CPU allows the execution of all
instructions, including dangerous instructions. When a process is running,
(%cs & 3) == 3
and the CPU does not execute dangerous instructions; instead,
if it is asked to execute a dangerous instruction, it raises a fault (a
kind of exception), saves processor state, and starts running the kernel. This
informs the kernel that a process tried to do something illegal. The kernel
will generally kill (stop) the offending process.
%cs & 3 |
Meaning |
---|---|
0 | Kernel (full machine privilege) |
1–2 | Unprivileged but unused in most operating systems |
3 | User (unprivileged) |
QUESTION. Is changing
%cs
(for instance,movw $0, %cs
) a dangerous instruction?
Protected control transfer
Recall what we saw how the stdio
library functions invoke system calls like
read()
and write()
. We saw in assembly that a syscall
instruction was
invoked, and it only returns after the system call was finished. This
syscall
instruction is a key interface via which user processes can interact
with the kernel. It implements a form of Protected Control Transfer -- it
transfers control of the processor to the kernel in a safe and limited way.
Protected control transfer is safe because a process can only enter kernel at well-specified entry points. The process can't just jump to random code reside within the kernel.
Every process's address space contains a portion reserved for the kernel.
(Usually this is the higher half of the address space, but in WeensyOS it’s
the lower part of the address space—the part below PROC_START_ADDR
.) One
can write a program that attempts to jump some of these instructions in the
kernel:
int main() {
unsigned long kernel_insn = 0xffffffff80000100;
void (*f)() = (void (*)())kernel_insn;
f();
}
This program will crash with a segmentation fault, because a user process is
not allowed to access anything reserved for the kernel directly. A user
process can only invoke the kernel at specific entry points by using the
syscall
instruction.
QUESTION. Why must we only allow control transfer to kernel at specific points?
The limitation and restriction guaranteed by protected control transfer is implemented by both the OS software and the hardware. This hardened interface between the user land and the kernel land is the conerstone of security in modern computer systems.
Virtual memory
The most fundamental resources on a computer are processor time, registers, and primary memory. Primary memory is RAM—random-access memory: the place where data is stored by address. Process isolation requires that each process have its own views of registers and primary memory. But RAM is stored in electrical components—transistors—and each computer has a RAM of fixed size, where the processor can access any byte of that RAM by address. How can a process be restricted from accessing other processes’ memories, or the kernel’s memory? If a process can modify the kernel’s instructions, then process isolation is fundamentally broken.
Modern computers implement memory isolation with a hardware feature called virtual memory. Just as with processor time, direct, fast access to memory from processes is so important that hardware support is used to implement protection.
We can model virtual memory as a mapping function \mathscr{P} : \textit{VA} \mapsto \textit{PA}. A virtual memory mapping function is like a pair of crazy glasses, or blinders, that constrain and modify the CPU’s view of memory. All memory addresses used by instructions pass through the virtual memory mapping function.
Let’s say an instruction accesses memory by address A. Here’s what happens.
-
The address A is actually a virtual address taken from the domain \textit{VA}. This address does not correspond to a byte in some RAM chip! Instead, to find the relevant piece of RAM, the CPU must use the current mapping function \mathscr{P}.
-
The processor therefore computes the value \mathscr{P}(A). This is a physical address taken from the domain \textit{PA}, or a special value denoting an illegal memory access that we call a memory fault or page fault. Physical addresses, unlike virtual addresses, do correspond to bytes in RAM chips.
-
If the instruction reads the byte at A, the processor will actually read physical memory byte \mathscr{P}(A). If the instruction writes the byte at A, the processor will actually write physical memory byte \mathscr{P}(A). (If \mathscr{P}(A) indicates a fault, the CPU will save the current register state and runs the kernel via a exceptional control transfer.)
-
The machine can have many mapping functions \mathscr{P}_0\dots\mathscr{P}_N, though only one mapping function is active at a time. These different functions can route the same virtual addresses to different physical addresses, thereby isolating different parts memory at different times. The current mapping function is defined using a special-purpose machine register—on x86-64, the
%cr3
register.
QUESTION. Should modifying the
%cr3
register be a dangerous instruction?
Page tables
Most modern architectures implement virtual memory mapping functions through a mechanism called page tables. A page table is an in-memory data structure, implemented on x86-64 as a tree with branching factor 512 and maximum depth 4. This data structure is set up and initialized by software—specifically the kernel—but used by the processor: the virtual memory system effectively uses the page table on every memory access to compute \mathscr{P}(A) for the supplied virtual address A. A page table thereby implements the filter, or glasses, through which memory is viewed.
Page tables on x86-64 and most other processors combine address translation with flags that can distinguish different kinds of access. Specifically, a page table defines, for each virtual address:
- Whether the corresponding physical address exists.
- P (Present) bit: this part of the filter is OK to access
- If not set, any access causes a fault
- Whether unprivileged access is allowed.
- U (Unprivileged) bit: this part of the filter is OK for unprivileged access
- If not set, any access by unprivileged software (
(%cs & 3) != 0
) causes a fault
- Whether the memory is read-only.
- W (Writable) bit: this part of the filter is OK for writes
- If not set, any write access causes a fault
- The physical address (assuming the above checks succeed).
This allows arbitrary rearrangement (or aliasing) of memory.
Virtual memory for a 64-byte memory
To make page tables concrete, we first describe a tiny, single-level page table for a machine with a one-byte address size, with just 64 meaningful addresses.
First, we divide memory into aligned blocks called pages. On this machine, we’ll say the page size is 8-every page is 8 bytes of memory, starting at an address that’s a multiple of 8. This splits memory addresses into two parts, the page index and the page offset. The following is a memory address in this architecture:
bit 5 | bit 4 | bit 3 | bit 2 | bit 1 | bit 0 |
---|---|---|---|---|---|
index | offset |
Each of the architecture’s 8 memory pages comprises 8 bytes of memory, starting at an address that’s a multiple of 8. Each page is identified by an index. Within a page, there are 8 bytes—8 different addresses—each with a different offset.
A page table must provide enough information to map any virtual address to a corresponding physical address. It does so by defining page table entries (PTEs) that map a specific virtual pagewhich comprises a set of virtual addresses with the same virtual page index—to a physical page, as well as any permission bits to be used for that virtual page. Here’s an example for this architecture:
bit 5 | bit 4 | bit 3 | bit 2 | bit 1 | bit 0 |
---|---|---|---|---|---|
physical page index (physical address >> 3) | flag bits | ||||
U | W | P |
In our 6-bit architecture, the lookup proceeds as follows (with virtual
address va
and access type at
):
- Start from physical address
%cr3
(location of the page table) - Access physical memory at
%cr3[va >> 3]
: this is the relevant page table entry. (va >> 3
is the page index) - Check flag bits; maybe fault.
- If the access is OK, return
(%cr3[va >> 3] & 0b111000) | (va & 7)
as the physical address.
Toward x86-64: Single level page table
x86-64, of course, has way more than 64 addresses. How can we design a page table format that works for 48-bit addresses?
Wait a minute, I thought x86-64 addresses were 64 bits! That’s right, they are, but in current processors, the 16 upper bits—bits 48–63—are reserved for future expansion. The value of each bit 48–63 must equal the value of bit 47, so only 248 of the 264 possible addresses are valid—specifically, 0x0000'0000'0000'0000–0x0000'7FFF'FFFF'FFFF and 0xFFFF'8000'0000'0000–0xFFFF'FFFF'FFFF'FFFF. (But some recent processors support 57-bit addresses!)
Let's first consider a single level page table. Here, a virtual address contains one index into the page table and an offset that is always 12 bits. When we use that index to find the corresponding entry in the page table page, we get a physical page address. The offset from the virtual address tells us the offset into the physical page.
Question: Why is the offset 12 bits?
Answer: We define the size of a page to be 4096 = 2^12 bytes. We want to be able to index into any byte of a destination physical page. So, we need 12 bits to represent every possible offset.
Question: Why is a single level not good enough?
Answer: Because we would need 2^39 bytes of data, which is too much memory! 2^39 = 2^36 * 2^3. The 2^36 comes from the maximum value the index can represent (an address is 64 bits and we reserve 12 bits for the offset, so we have 36 bits remaining for the index). The 2^3 comes from an address being 2^3 = 8 bytes.
x86-64 page tables
As we just saw, a single level page table takes up a lot of space! We can save some space by using multiple levels. We can think of a multi-level page table structure as a tree. Multiple levels leads to less space because when we look up the physical page for a given virtual address, we may visit a branch of the tree that tells us there's actually no valid physical page for us to access. In that case, we just stop searching. So, multiple levels means we can have a sparse tree.
Consequences for virtual addresses
The x86-64 architecture uses 4 levels. This is reflected in the structure of a virtual address:
63-48 | 47-39 | 38-29 | 29-21 | 20-12 | 11-0 |
---|---|---|---|---|---|
L4 | L3 | L2 | L1 | OFFSET |
In x86-64 virtual address has 64 bits, but only the first 48 bits are meaningful. We have 9 bits to index into each page table level, and 12 bits for the offset. This means 16 bits are left over and unused.
Question: Why does each index get 9 bits?
Answer: Because the size of one page is 2^12 bytes, and each page table page entry is an address which has 2^3 bytes. 2^12/2^3 = 2^9 entries per page. We want to be able to index into any given entry, which means we need 9 bits.
%cr3
We store the physical address of the top level (L4) page table in a special register: %cr3
.
Question: Why does %cr3
store a physical address, when every other register stores a virtual address?
Answer: Page tables are used to convert virtual addresses to physical addresses, so if we stored our top level page table address as a virtual address then we wouldn't know how to convert it to a physical address!
The lookup process
A successful lookup (finding a physical address from a virtual address) goes as follows:
- Use
%cr3
and the L4 index from the virtual address to get the L3 page table address - Use the L3 page table address and the L3 index to get a L2 page table address
- Use the L2 page table address and the L2 index to get a L1 page table address
- Use the L1 page table address and the L1 index to get the destination physical page
- Use the destination physical page and the offset to get actual physical address within that destination physical page
Flags
Each entry in a page table is an address with the following structure:
63 | 62-48 | 47-12 | 11-3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|
NX | Physical address | U | W | P |
Bits 0-2 contain the P (present), W (writable), and U (user-accessible) flags. However, we can also have other flags, like the NX (non-executable stack) bit, which prevents us from executing instructions on the stack. This is important for preventing buffer overflows!
Again, page table entries don't have an offset because we used the physical addresses they store to find the start of another page table page or a destination physical page. We use the bits in a virtual address to access specific locations in those pages.
WeensyOS
The goals of pset 4 are to add process isolation, and implement the fork
and exit
system calls. In the handout code, there is no process isolation because each process has the same page table!
We run WeensyOS in QEMU, which emulates hardware for an x86-64 architecture.
Kernel and user addresses
The kernel (designated as K
) lives in addresses that start with 0x40000
, while processes live in the upper addresses of virtual memory, starting at 0x10000
.
We also have hardware pages (designated as R
) in the middle of the physical address space. This is because Bill Gates once said "no one should ever need more than 640K of memory", and processors would give us 1 MB of memory. The hardware lives in the upper portion of that memory (between 640 K and 1 MB).
The hardware includes one page marked as C
for the CGA console. The console is an instance of memory mapped I/O. The console is not memory, but behaves like it; we can write output to the console by writing values in memory
Control transfer
We want to be able to switch into the kernel from user space and vice versa. All of these entry/exit points are defined in k-exception.s
. You won't need to modify this file, but you should understand what it does.
When we have an exception (e.g. a timer interrupt or a segfault), or the user makes a syscall, we need to switch into the kernel. First, we save processor state by saving each register. The process's %rsp
is pre-saved on the kernel stack. Then, we jump into a specific place in the kernel, which is determined by the entry code in k-exception.s
.
When we want to give control back to the user process, we simply restore the registers we already saved.
System calls and protected control transfer
The following is a list of actions that will occur, either by the process or by the kernel, once a process invokes a system call:
- The process sets up arguments of the system call in registers, according to the system call's calling convention.
- The process invokes the
syscall
instruction, which initiates a protected control transfer to the kernel. - The kernel starts executing from a pre-defined entry point, and executes a handler of the system call.
- The kernel finishes processing the system call, and it picks another process to resume execution.
In the user space, a process has a system call "wrapper", which is a stub
function that sets up the necessary state for executing a system call, and
actually invokes the syscall
instruction to transfer control to the kernel.
Let's take look at the following system call, sys_getsysname
, and this is
its wrapper in the user space, defined in process.hh
:
inline int sys_getsysname(char* buf) {
register uintptr_t rax asm("rax") = SYSCALL_GETSYSNAME;
asm volatile ("syscall"
: "+a" (rax), "+D" (buf)
:
: "cc", "rcx", "rdx", "rsi",
"r8", "r9", "r10", "r11", "memory");
return rax;
}
The list of arguments
cc
,rcx
, ... at the very end of the inline assembly tells the compiler that all these registers will be destroyed by the system call. Because system call is different from a normal function call, the compiler needs to be explicitly informed about its calling convention.
The kernel's handler of this system call is located in kernel.cc
, within the
syscall()
function. The relevant part is shown below:
case SYSCALL_GETSYSNAME: {
const char* osname = "DemoOS 61.61";
char* buf = (char*) current->regs.reg_rdi;
strcpy(buf, osname);
return 0;
}
It appears that the kernel is getting an argument of the system call, which is
the buffer where to put the system call name, from current->regs.reg_rdi
.
Let's break it down how it works:
- Each process has a process descriptor structure, maintained by the kernel.
- When a
syscall
instruction is invoked, the kernel takes over, and it can access the process's process descriptor via pointercurrent
. - The process descriptor maintains many information about the process, including
the register state of the process right before the protected control transfer
occurred. Some work is done by the kernel to copy all that information to the
process descriptor (in this case, all register values are copied to a struct
called
reg
within the process descriptor), more on that later. - Since register
%rdi
points to the buffer beforesyscall
is executed (normal x86 calling convention), the kernel can access its value usingcurrent->reg.reg_rdi
.
Let's explain in detail how the kernel saved all the information to the
process descriptor after a syscall
instruction gets invoked. The kernel
doesn't start executing the syscall()
function directly once the protected
control transfer occurs. The actual entry point for syscall
instructions is
defined in k-exception.S
, line 135:
syscall_entry:
movq %rsp, KERNEL_STACK_TOP - 16 // save entry %rsp to kernel stack
movq $KERNEL_STACK_TOP, %rsp // change to kernel stack
// structure used by `iret`:
pushq $(SEGSEL_APP_DATA + 3) // %ss
subq $8, %rsp // skip saved %rsp
pushq %r11 // %rflags
pushq $(SEGSEL_APP_CODE + 3) // %cs
pushq %rcx // %rip
// other registers:
subq $8, %rsp // error code unused
pushq $-1 // reg_intno
pushq %gs
pushq %fs
pushq %r15 // callee saved
pushq %r14 // callee saved
pushq %r13 // callee saved
pushq %r12 // callee saved
subq $8, %rsp // %r11 clobbered by `syscall`
pushq %r10
pushq %r9
pushq %r8
pushq %rdi
pushq %rsi
pushq %rbp // callee saved
pushq %rbx // callee saved
pushq %rdx
subq $8, %rsp // %rcx clobbered by `syscall`
pushq %rax
// load kernel page table
movq $kernel_pagetable, %rax
movq %rax, %cr3
// call syscall()
movq %rsp, %rdi
call _Z7syscallP8regstate
// load process page table
movq current, %rcx
movq (%rcx), %rcx
movq %rcx, %cr3
// skip over other registers
addq $(8 * 19), %rsp
// return to process
iretq
We can see that the kernel eventually calls the syscall()
function before
returning to the process, but a lot of setup work is done before the function
call. Here is a high-level summary about what these setups are about:
- Save and set
%rsp
so that the kernel starts using its own stack, instead of the process' stack. - Set up a structure, on the kernel stack, used by the
iretq
instruction to return to the process after the system call finishes. - Push all other user registers to the kernel stack.
- Up until this point the kernel is running using the process's page table,
switch the hardware (by setting
%cr3
) to use the kernel page table. - Call the C++ function
syscall()
, using the register structure we just saved on the stack as argument. - Restore to the process's page table, and return to the process.
In the syscall()
function we copy the register structure passed in to the
regs
field in the process descriptor.
When a system call "returns", you can think of it as that all register values
in current->regs
gets copied (or restored) to the actual processor's registers
before the process resumes execution.
System call handing is a bit like programming with exceptions. If you have experience with exceptions in another programming language, it may help you understand system calls. Every time a new system call occurs and the handler gets executed, the handler is not aware of any prior invocations of the same handler, unless mediated by some other state explicitly managed by the handler. The entire kernel stack gets "blown away" once a system call "returns". You can also think of it as a event-driven programming model.
Types of protected control transfers
- Interrupts: caused by non-CPU hardware (e.g. timer)
- Traps: caused by software intentionally (syscall)
- Faults: caused by software error
It's worth pointing out that all these control transfers are administered by the x86 CPU.
When a process "returns" from a protected control transfer, kernel restores
its %rip
register to point to:
- Interrupts: the next instruction that hasn't been executed yet.
- Traps: the next instruction following the trap.
- Faults: the problematic instruction causing the fault.
TL;DR: If a process enters the kernel because of an interrupt or trap, then once it resumes execution it will pick up from where it left off. If a process enters the kernel because of a fault, then it will retry the faulty instruction if it resumes.
Q: How can the kernel, in the system call handler, simply overwrite register values from the process? Won't that mess up the process's state and cause it to crash?
Answer: System calls are traps (also called synchronized events), which
means they are explicitly invoked by the process, and the process expects a
control transfer to occur and respects the calling convention of such control
transfers. This is in contrast to interrupts and faults where such control
transfers occur without the process's knowledge. In the case of system calls,
calling convention designates certain registers to be overwritten by the
kernel to convey information regarding results of the system call (%rax
, for
example hold return value of the system call), so the kernel are free to
overwriting these registers. In fact, without using these registers, it
becomes rather difficult to convey results of a system call unless the kernel
exposes parts of its own memory to the process. It is worth noting though that
the kernel can't just overwrite process registers arbitrarily.
Let's take a look at an example of an interrupt next, the timer interrupt (
kernel.cc:241
):
case INT_TIMER:
++ticks;
schedule();
break; /* will not be reached */
The code above is located in an exception handler, which is similar to the
system call handler in terms of how it saves the process's state to the
process descriptor. We see it simply increments a ticks
variable, and calls
the schedule()
function, which picks another process to execute on the
processor. Note that in this interrupt handling code no modifications were
made to the process's register state. The process does not expect a timer
interrupt to occur so we had better make them transparent to the process. By
not modifying any registers we achieve this goal.
The sys_yeild
system call, which is similar to the timer interrupt, has the
following relevant code in the system call handler:
case SYSCALL_YIELD:
current->regs.reg_rax = 0;
schedule(); // does not return
Here we can modify the process's %rax
state because again, it is a system
call, and the process expects the occurrence of a control transfer and the
overwriting of value in %rax
by the kernel.
So, we have a super well-isolated operating system, called DemoOS, and there is absolutely nothing a program can do to take over the machine.
Absolutely nothing.
Or, is there?
Alice and Eve
We now look at two programs written by two rivals, Alice and Eve. They will be running on DemoOS.
This is Alice, in p-alice.cc
:
#include "process.hh"
#include "lib.hh"
void process_main() {
char buf[128];
sys_getsysname(buf);
app_printf(1, "Welcome to %s\n", buf);
unsigned i = 0;
while (1) {
++i;
if (i % 512 == 0) {
app_printf(1, "Hi, I'm Alice! #%d\n", i / 512);
}
sys_yield();
}
}
And this is Eve, in p-eve.cc
:
#include "process.hh"
#include "lib.hh"
void process_main() {
unsigned i = 0;
while (1) {
++i;
if (i % 512 == 0) {
app_printf(0, "Hi, I'm Eve! #%d\n", i / 512);
}
sys_yield();
}
}
We can see both Alice and Eve contain infinite loops, but programs are being
nice! By explicitly invoking the sys_yield()
system call, they are
"yielding" (i.e. letting another process to run) precious CPU time to each
other so that both of them can make progress.
If our DemoOS is as wonderful as we claimed, there is absolutely nothing Eve can do to prevent Alice from running, and vice versa. The two always get about equal share of CPU resources.
Attack 1: No yielding any more
If Eve stops calling sys_yeild()
, then Eve no longer actively yields the CPU
to any other program, and it takes over the entire machine.
This problem occurs because timer interrupt is properly configured. After initializing the timer interrupt, Eve no longer gets to take over the machine. Alice gets chance to run again, although Eve still visibly uses more resources from the machine by not yielding. This strictly speaking doesn't provide fairness, but it is a reasonable policy an OS may choose to implement.
Attack 2: Disabling interrupts
Eve disables interrupts by using the cli
instruction. Alice again gets no
CPU time after Eve successfully executes this instructions.
We fix it by disallowing processes to control interrupts. Now Eve is not able
to execute cli
, or it will crash.
When Eve attempts to execute the no-longer-allowed
cli
instruction, the hardware generates a fault and transfers control to the kernel. We made the kernel handle this fault as an exception, and it falls under caseINT_GPF
, or general protection fault. This is a catch-all default type of fault the hardware throws when the reason of the the error doesn't fall under any of the more specific types of fault.
Attack 3: I crash, you crash (divide by zero)
Eve then changes its program to contain a divide by zero error. When that instruction hits DemoOS crashes and nobody gets to run.
As it turns out divide by zero error triggers another hardware exception that
was not handled by us. We don't really want to list all the exceptions one-by-
one and write code to handle them all, because for many of them we do the
exact same thing: kill the faulting process. We may be tempted to just add
this code in the default
case: for all unexpected exceptions, just kill the
process.
We need to be careful here though, as we should really only do this if an
exception stems from problems in user code. If the kernel throws an exception,
it usually indicates a serious bug in the kernel and it's a bad idea to carry
on with life as if it never happened. This can be done by adding a simple
check under the default
case:
default:
if (regs->reg_cs & 3 != 0) {
current->state = P_BROKEN;
} else {
panic("Unexpected exception %d!\n", regs->reg_intno);
}
break;
The least significant 2 bits of
%cs
register stores the privilege level the processor is running at before the fault occurred.
We can also have more fun with Eve. Imagine that we don't crash Eve's program when a divide-by-zero occurred, but to confuse her. We can do this by handling the divide-by-zero exception this way:
case INT_DIVIDE:
current->regs.reg_rax = 61;
current->regs.reg_rip += 2;
break;
As per x86 specification, this should be enough to convince Eve that anything
divides by zero is always 61. (The specification of the idiv
instruction
says that the quotient of the division is stored in %rax
). We also
incremented Eve's %rip
because divide by zero is a fault, and %rip
saved
by the kernel will point to the faulty instruction, which is the division
instruction. Without changing %rip
Eve would re-execute the idiv
instruction once it resumes. We move past the divide instruction by adding 2
to %rip
(the idiv
instruction is 2-byte-long). This shows the control and
power the kernel has over a process.
Attack 4: Jump-to-kernel
Eve now examines the kernel assembly and finds out that the syscall
entry
point is located at address 0x40ac6
. She then made her program write two
magical bytes to that location: 0xeb 0xfe
. The two bytes form an evil
instruction that jumps to itself: another infinite loop attack! Now whenever a
system call is made, the kernel enters an infinite loop, and the machine
hangs.
Infinite loops in the kernel is particularly disastrous because the kernel usually runs with interrupts disabled.
This attack can succeed because DemoOS doesn't properly implement kernel memory isolation -- a user process can access (read and write) any kernel memory that's mapped in the process's address space. We isolate the kernel by setting the proper permissions for kernel memory:
for (vmiter it(kernel_pagetable, 0);
it.va() < MEMSIZE_PHYSICAL;
it += PAGESIZE) {
// Don't set the U bit, except for the console page
if (it.va() != (uintptr_t) console) {
it.map(it.pa(), PTE_P | PTE_W);
}
}
After adding protection for kernel memory, Eve crashes after attempting to
overwrite the syscall
entry point, and Alice gets to run till the end.
Epilogue
Process isolation is still far from achieved in DemoOS. There are several kinds of attacks Eve can still perform against Alice. Just to name a couple:
- Eve can clobber Alice's memory
- Fork bomb...
Thanks William for playing Eve from MIT. +++ title = "Kernel 5: Confused deputy attack, scheduling, and process management" navgroup = "lectures" weight = 450 +++
Confused deputy attack
A confused deputy attack occurs when the attacker has low privilege, but the attacker convinced a privileged deputy to complete the attack on its behalf.
In the context of operating systems, a process is unprivileged, the kernel has full privilege and acts as a privileged deputy by handling system calls. A confused deputy attack may occur if the process, by invoking system calls, can somehow convince the kernel to execute a privileged attack.
Certain system calls are vulnerable to such attacks, but others are not. Very
simple system calls like sys_getpid
usually aren't susceptible to these
attacks because they don't change the state of the kernel or other processes
at all -- it simply copies the current process's pid
value to its %rax
register, and that's it. It's so simple that there is little room for bad
things to happen.
Recall that at the end of last lecture, Eve tried to overwrite the syscall
entry point in kernel memory with an infinite loop. We fixed this by denying
access to kernel memory from the user level. Now Eve cannot just directly
write to kernel memory any more, but is it possible to for Eve to convince
the kernel to write malicious data/code to kernel memory?
Let's take a look at sys_getsysname
. Could it be used by Eve to perpetrate a
confused deputy attack? What if Eve simply does the following:
char* syscall_entry_addr = (char*)0x40ac6;
sys_getsysname(syscall_entry_addr);
Eve actually manages to crash the entire OS by adding just these 2 lines of
code! The kernel did not perform any sanity checks before writing the string
containing the OS name to the user-supplied buffer. In this case the buffer
happens to point to the syscall
entry point in kernel memory, and the
kernel happily overwrote it. A successful confused deputy attack!
It's worth pointing out that Eve never directly wrote to any kernel memory. The kernel overwrote part of its own memory because the buffer passed by Eve points to kernel memory.
You may wonder why would a buffer in Eve's address space point to a critical component in the kernel's address space. It is true that when the kernel performs the string copy, it uses Eve's page table to perform address translation (for the destination of the copy). However, note that in DemoOS (and WeensyOS), the part of the page table mapping below
PROC_START_ADDR
(where thesyscall
entry point is located) is shared among all processes as well as the kernel. So this problematic address Eve passes to the kernel will translate into the same physical address, in kernel memory, regardless of which page table (including the kernel's page table) is used for address translation. Additionally, because processes and the kernel also share the same physical pages for mappings belowPROC_START_ADDR
, overwriting instructions there has a global effect, meaning that all process's (and the kernel's)syscall
entry point code will get corrupted.
Eve is still not satisfied, because this attack blows away all processes in the system, including Eve itself. Eve would like a more targeted attack against only Alice. Eve could try, via a confused deputy attack, to corrupt Alice's register file in its process descriptor.
With information about kernel memory layout and Alice's PID, it is
possible for Eve to figure out where Alice's process descriptor is located in
memory. Eve then once again uses the sys_getsysname()
system call to corrupt
Alice's process descriptor, and after the system call is executed, Alice
crashes and only Eve gets to run on the system.
Confused deputy attack can be far more devious and do way more damage than demonstrated. For example, if a malicious process can somehow convince the kernel to turn on certain bits in its page table, it could suddenly gain access to kernel memory and even inspect/change other process's memory. Instead of simply causing the victim to crash, it can passive monitor and steal information from the victim or do more.
To prevent such attacks, the kernel should always perform checks on user-supplied
inputs before acting on them. In this example of sys_getsysname()
, we
should make sure that the buffer address supplied by the user is mapped as
user-accessible in the process's page table. A safer version of the
sys_getsysname()
handler should look like the following:
case SYSCALL_GETSYSNAME: {
const char* osname = "DemoOS 61.61";
char* buf = (char*) current->regs.reg_rdi;
// Check that the entire span of the buffer is mapped
// as user-accessible
size_t len = strlen(osname) + 1;
size_t i = 0;
for (vmiter it(current, (uintptr_t) buf);
i < len;
it += 1, ++i) {
if (!it.user()) {
return -1;
} else {
// Performs the actual copy
// Note that the destination of the copy is
// address-translated using the process's
// page table.
*((char*) it.pa()) = osname[i];
}
}
return 0;
}
We briefly note here that sys_page_alloc()
has a similar vulnerability. Eve
could request to map a new physical page to the virtual page containing the
syscall
entry point instructions in its address space. It only affects Eve's
page table and has no global effect, but it grants Eve control over its
syscall
entry point, and Eve can put arbitrary code there. The next time Eve
executes a system call, the inserted code within the newly mapped physical
page will run in privileged mode, and it can easily compromise the entire
machine. Note that in this case no direct corruption of kernel memory
occurred, but a confused deputy attack can still take place.