System calls and protected control transfer
The following is a list of actions that will occur, either by the process or by the kernel, once a process invokes a system call:
- The process sets up arguments of the system call in registers, according to the system call's calling convention.
- The process invokes the
syscall
instruction, which initiates a protected control transfer to the kernel. - The kernel starts executing from a pre-defined entry point, and executes a handler of the system call.
- The kernel finishes processing the system call, and it picks another process to resume execution.
In the user space, a process has a system call "wrapper", which is a stub
function that sets up the necessary state for executing a system call, and
actually invokes the syscall
instruction to transfer control to the kernel.
Let's take look at the following system call, sys_getsysname
, and this is
its wrapper in the user space, defined in process.hh
:
inline int sys_getsysname(char* buf) {
register uintptr_t rax asm("rax") = SYSCALL_GETSYSNAME;
asm volatile ("syscall"
: "+a" (rax), "+D" (buf)
:
: "cc", "rcx", "rdx", "rsi",
"r8", "r9", "r10", "r11", "memory");
return rax;
}
The list of arguments
cc
,rcx
, ... at the very end of the inline assembly tells the compiler that all these registers will be destroyed by the system call. Because system call is different from a normal function call, the compiler needs to be explicitly informed about its calling convention.
The kernel's handler of this system call is located in kernel.cc
, within the
syscall()
function. The relevant part is shown below:
case SYSCALL_GETSYSNAME: {
const char* osname = "DemoOS 61.61";
char* buf = (char*) current->regs.reg_rdi;
strcpy(buf, osname);
return 0;
}
It appears that the kernel is getting an argument of the system call, which is
the buffer where to put the system call name, from current->regs.reg_rdi
.
Let's break it down how it works:
- Each process has a process descriptor structure, maintained by the kernel.
- When a
syscall
instruction is invoked, the kernel takes over, and it can access the process's process descriptor via pointercurrent
. - The process descriptor maintains many information about the process, including
the register state of the process right before the protected control transfer
occurred. Some work is done by the kernel to copy all that information to the
process descriptor (in this case, all register values are copied to a struct
called
reg
within the process descriptor), more on that later. - Since register
%rdi
points to the buffer beforesyscall
is executed (normal x86 calling convention), the kernel can access its value usingcurrent->reg.reg_rdi
.
Let's explain in detail how the kernel saved all the information to the
process descriptor after a syscall
instruction gets invoked. The kernel
doesn't start executing the syscall()
function directly once the protected
control transfer occurs. The actual entry point for syscall
instructions is
defined in k-exception.S
, line 135:
syscall_entry:
movq %rsp, KERNEL_STACK_TOP - 16 // save entry %rsp to kernel stack
movq $KERNEL_STACK_TOP, %rsp // change to kernel stack
// structure used by `iret`:
pushq $(SEGSEL_APP_DATA + 3) // %ss
subq $8, %rsp // skip saved %rsp
pushq %r11 // %rflags
pushq $(SEGSEL_APP_CODE + 3) // %cs
pushq %rcx // %rip
// other registers:
subq $8, %rsp // error code unused
pushq $-1 // reg_intno
pushq %gs
pushq %fs
pushq %r15 // callee saved
pushq %r14 // callee saved
pushq %r13 // callee saved
pushq %r12 // callee saved
subq $8, %rsp // %r11 clobbered by `syscall`
pushq %r10
pushq %r9
pushq %r8
pushq %rdi
pushq %rsi
pushq %rbp // callee saved
pushq %rbx // callee saved
pushq %rdx
subq $8, %rsp // %rcx clobbered by `syscall`
pushq %rax
// load kernel page table
movq $kernel_pagetable, %rax
movq %rax, %cr3
// call syscall()
movq %rsp, %rdi
call _Z7syscallP8regstate
// load process page table
movq current, %rcx
movq (%rcx), %rcx
movq %rcx, %cr3
// skip over other registers
addq $(8 * 19), %rsp
// return to process
iretq
We can see that the kernel eventually calls the syscall()
function before
returning to the process, but a lot of setup work is done before the function
call. Here is a high-level summary about what these setups are about:
- Save and set
%rsp
so that the kernel starts using its own stack, instead of the process' stack. - Set up a structure, on the kernel stack, used by the
iretq
instruction to return to the process after the system call finishes. - Push all other user registers to the kernel stack.
- Up until this point the kernel is running using the process's page table,
switch the hardware (by setting
%cr3
) to use the kernel page table. - Call the C++ function
syscall()
, using the register structure we just saved on the stack as argument. - Restore to the process's page table, and return to the process.
In the syscall()
function we copy the register structure passed in to the
regs
field in the process descriptor.
When a system call "returns", you can think of it as that all register values
in current->regs
gets copied (or restored) to the actual processor's registers
before the process resumes execution.
System call handing is a bit like programming with exceptions. If you have experience with exceptions in another programming language, it may help you understand system calls. Every time a new system call occurs and the handler gets executed, the handler is not aware of any prior invocations of the same handler, unless mediated by some other state explicitly managed by the handler. The entire kernel stack gets "blown away" once a system call "returns". You can also think of it as a event-driven programming model.
Types of protected control transfers
- Interrupts: caused by non-CPU hardware (e.g. timer)
- Traps: caused by software intentionally (syscall)
- Faults: caused by software error
It's worth pointing out that all these control transfers are administered by the x86 CPU.
When a process "returns" from a protected control transfer, kernel restores
its %rip
register to point to:
- Interrupts: the next instruction that hasn't been executed yet.
- Traps: the next instruction following the trap.
- Faults: the problematic instruction causing the fault.
TL;DR: If a process enters the kernel because of an interrupt or trap, then once it resumes execution it will pick up from where it left off. If a process enters the kernel because of a fault, then it will retry the faulty instruction if it resumes.
Q: How can the kernel, in the system call handler, simply overwrite register values from the process? Won't that mess up the process's state and cause it to crash?
Answer: System calls are traps (also called synchronized events), which
means they are explicitly invoked by the process, and the process expects a
control transfer to occur and respects the calling convention of such control
transfers. This is in contrast to interrupts and faults where such control
transfers occur without the process's knowledge. In the case of system calls,
calling convention designates certain registers to be overwritten by the
kernel to convey information regarding results of the system call (%rax
, for
example hold return value of the system call), so the kernel are free to
overwriting these registers. In fact, without using these registers, it
becomes rather difficult to convey results of a system call unless the kernel
exposes parts of its own memory to the process. It is worth noting though that
the kernel can't just overwrite process registers arbitrarily.
Let's take a look at an example of an interrupt next, the timer interrupt (
kernel.cc:241
):
case INT_TIMER:
++ticks;
schedule();
break; /* will not be reached */
The code above is located in an exception handler, which is similar to the
system call handler in terms of how it saves the process's state to the
process descriptor. We see it simply increments a ticks
variable, and calls
the schedule()
function, which picks another process to execute on the
processor. Note that in this interrupt handling code no modifications were
made to the process's register state. The process does not expect a timer
interrupt to occur so we had better make them transparent to the process. By
not modifying any registers we achieve this goal.
The sys_yeild
system call, which is similar to the timer interrupt, has the
following relevant code in the system call handler:
case SYSCALL_YIELD:
current->regs.reg_rax = 0;
schedule(); // does not return
Here we can modify the process's %rax
state because again, it is a system
call, and the process expects the occurrence of a control transfer and the
overwriting of value in %rax
by the kernel.
So, we have a super well-isolated operating system, called DemoOS, and there is absolutely nothing a program can do to take over the machine.
Absolutely nothing.
Or, is there?
Alice and Eve
We now look at two programs written by two rivals, Alice and Eve. They will be running on DemoOS.
This is Alice, in p-alice.cc
:
#include "process.hh"
#include "lib.hh"
void process_main() {
char buf[128];
sys_getsysname(buf);
app_printf(1, "Welcome to %s\n", buf);
unsigned i = 0;
while (1) {
++i;
if (i % 512 == 0) {
app_printf(1, "Hi, I'm Alice! #%d\n", i / 512);
}
sys_yield();
}
}
And this is Eve, in p-eve.cc
:
#include "process.hh"
#include "lib.hh"
void process_main() {
unsigned i = 0;
while (1) {
++i;
if (i % 512 == 0) {
app_printf(0, "Hi, I'm Eve! #%d\n", i / 512);
}
sys_yield();
}
}
We can see both Alice and Eve contain infinite loops, but programs are being
nice! By explicitly invoking the sys_yield()
system call, they are
"yielding" (i.e. letting another process to run) precious CPU time to each
other so that both of them can make progress.
If our DemoOS is as wonderful as we claimed, there is absolutely nothing Eve can do to prevent Alice from running, and vice versa. The two always get about equal share of CPU resources.
Attack 1: No yielding any more
If Eve stops calling sys_yeild()
, then Eve no longer actively yields the CPU
to any other program, and it takes over the entire machine.
This problem occurs because timer interrupt is properly configured. After initializing the timer interrupt, Eve no longer gets to take over the machine. Alice gets chance to run again, although Eve still visibly uses more resources from the machine by not yielding. This strictly speaking doesn't provide fairness, but it is a reasonable policy an OS may choose to implement.
Attack 2: Disabling interrupts
Eve disables interrupts by using the cli
instruction. Alice again gets no
CPU time after Eve successfully executes this instructions.
We fix it by disallowing processes to control interrupts. Now Eve is not able
to execute cli
, or it will crash.
When Eve attempts to execute the no-longer-allowed
cli
instruction, the hardware generates a fault and transfers control to the kernel. We made the kernel handle this fault as an exception, and it falls under caseINT_GPF
, or general protection fault. This is a catch-all default type of fault the hardware throws when the reason of the the error doesn't fall under any of the more specific types of fault.
Attack 3: I crash, you crash (divide by zero)
Eve then changes its program to contain a divide by zero error. When that instruction hits DemoOS crashes and nobody gets to run.
As it turns out divide by zero error triggers another hardware exception that
was not handled by us. We don't really want to list all the exceptions one-by-
one and write code to handle them all, because for many of them we do the
exact same thing: kill the faulting process. We may be tempted to just add
this code in the default
case: for all unexpected exceptions, just kill the
process.
We need to be careful here though, as we should really only do this if an
exception stems from problems in user code. If the kernel throws an exception,
it usually indicates a serious bug in the kernel and it's a bad idea to carry
on with life as if it never happened. This can be done by adding a simple
check under the default
case:
default:
if (regs->reg_cs & 3 != 0) {
current->state = P_BROKEN;
} else {
panic("Unexpected exception %d!\n", regs->reg_intno);
}
break;
The least significant 2 bits of
%cs
register stores the privilege level the processor is running at before the fault occurred.
We can also have more fun with Eve. Imagine that we don't crash Eve's program when a divide-by-zero occurred, but to confuse her. We can do this by handling the divide-by-zero exception this way:
case INT_DIVIDE:
current->regs.reg_rax = 61;
current->regs.reg_rip += 2;
break;
As per x86 specification, this should be enough to convince Eve that anything
divides by zero is always 61. (The specification of the idiv
instruction
says that the quotient of the division is stored in %rax
). We also
incremented Eve's %rip
because divide by zero is a fault, and %rip
saved
by the kernel will point to the faulty instruction, which is the division
instruction. Without changing %rip
Eve would re-execute the idiv
instruction once it resumes. We move past the divide instruction by adding 2
to %rip
(the idiv
instruction is 2-byte-long). This shows the control and
power the kernel has over a process.
Attack 4: Jump-to-kernel
Eve now examines the kernel assembly and finds out that the syscall
entry
point is located at address 0x40ac6
. She then made her program write two
magical bytes to that location: 0xeb 0xfe
. The two bytes form an evil
instruction that jumps to itself: another infinite loop attack! Now whenever a
system call is made, the kernel enters an infinite loop, and the machine
hangs.
Infinite loops in the kernel is particularly disastrous because the kernel usually runs with interrupts disabled.
This attack can succeed because DemoOS doesn't properly implement kernel memory isolation -- a user process can access (read and write) any kernel memory that's mapped in the process's address space. We isolate the kernel by setting the proper permissions for kernel memory:
for (vmiter it(kernel_pagetable, 0);
it.va() < MEMSIZE_PHYSICAL;
it += PAGESIZE) {
// Don't set the U bit, except for the console page
if (it.va() != (uintptr_t) console) {
it.map(it.pa(), PTE_P | PTE_W);
}
}
After adding protection for kernel memory, Eve crashes after attempting to
overwrite the syscall
entry point, and Alice gets to run till the end.
Epilogue
Process isolation is still far from achieved in DemoOS. There are several kinds of attacks Eve can still perform against Alice. Just to name a couple:
- Eve can clobber Alice's memory
- Fork bomb...
Thanks William for playing Eve from MIT.