Kernel 4: Protection and isolation

System calls and protected control transfer

The following is a list of actions that will occur, either by the process or by the kernel, once a process invokes a system call:

The process sets up arguments of the system call in registers, according to the system call's calling convention.
The process invokes the syscall instruction, which initiates a protected control transfer to the kernel.
The kernel starts executing from a pre-defined entry point, and executes a handler of the system call.
The kernel finishes processing the system call, and it picks another process to resume execution.

In the user space, a process has a system call "wrapper", which is a stub function that sets up the necessary state for executing a system call, and actually invokes the syscall instruction to transfer control to the kernel. Let's take look at the following system call, sys_getsysname, and this is its wrapper in the user space, defined in process.hh:

inline int sys_getsysname(char* buf) {
    register uintptr_t rax asm("rax") = SYSCALL_GETSYSNAME;
    asm volatile ("syscall"
                  : "+a" (rax), "+D" (buf)
                  :
                  : "cc", "rcx", "rdx", "rsi",
                    "r8", "r9", "r10", "r11", "memory");
    return rax;
}

The list of arguments cc, rcx, ... at the very end of the inline assembly tells the compiler that all these registers will be destroyed by the system call. Because system call is different from a normal function call, the compiler needs to be explicitly informed about its calling convention.

The kernel's handler of this system call is located in kernel.cc, within the syscall() function. The relevant part is shown below:

case SYSCALL_GETSYSNAME: {
    const char* osname = "DemoOS 61.61";
    char* buf = (char*) current->regs.reg_rdi;
    strcpy(buf, osname);
    return 0;
}

It appears that the kernel is getting an argument of the system call, which is the buffer where to put the system call name, from current->regs.reg_rdi. Let's break it down how it works:

Each process has a process descriptor structure, maintained by the kernel.
When a syscall instruction is invoked, the kernel takes over, and it can access the process's process descriptor via pointer current.
The process descriptor maintains many information about the process, including the register state of the process right before the protected control transfer occurred. Some work is done by the kernel to copy all that information to the process descriptor (in this case, all register values are copied to a struct called reg within the process descriptor), more on that later.
Since register %rdi points to the buffer before syscall is executed (normal x86 calling convention), the kernel can access its value using current->reg.reg_rdi.

Let's explain in detail how the kernel saved all the information to the process descriptor after a syscall instruction gets invoked. The kernel doesn't start executing the syscall() function directly once the protected control transfer occurs. The actual entry point for syscall instructions is defined in k-exception.S, line 135:

syscall_entry:
    movq %rsp, KERNEL_STACK_TOP - 16 // save entry %rsp to kernel stack
    movq $KERNEL_STACK_TOP, %rsp     // change to kernel stack

    // structure used by `iret`:
    pushq $(SEGSEL_APP_DATA + 3)   // %ss
    subq $8, %rsp                  // skip saved %rsp
    pushq %r11                     // %rflags
    pushq $(SEGSEL_APP_CODE + 3)   // %cs
    pushq %rcx                     // %rip

    // other registers:
    subq $8, %rsp                  // error code unused
    pushq $-1                      // reg_intno
    pushq %gs
    pushq %fs
    pushq %r15 // callee saved
    pushq %r14 // callee saved
    pushq %r13 // callee saved
    pushq %r12 // callee saved
    subq $8, %rsp                  // %r11 clobbered by `syscall`
    pushq %r10
    pushq %r9
    pushq %r8
    pushq %rdi
    pushq %rsi
    pushq %rbp // callee saved
    pushq %rbx // callee saved
    pushq %rdx
    subq $8, %rsp                  // %rcx clobbered by `syscall`
    pushq %rax

    // load kernel page table
    movq $kernel_pagetable, %rax
    movq %rax, %cr3

    // call syscall()
    movq %rsp, %rdi
    call _Z7syscallP8regstate

    // load process page table
    movq current, %rcx
    movq (%rcx), %rcx
    movq %rcx, %cr3

    // skip over other registers
    addq $(8 * 19), %rsp

    // return to process
    iretq

We can see that the kernel eventually calls the syscall() function before returning to the process, but a lot of setup work is done before the function call. Here is a high-level summary about what these setups are about:

Save and set %rsp so that the kernel starts using its own stack, instead of the process' stack.
Set up a structure, on the kernel stack, used by the iretq instruction to return to the process after the system call finishes.
Push all other user registers to the kernel stack.
Up until this point the kernel is running using the process's page table, switch the hardware (by setting %cr3) to use the kernel page table.
Call the C++ function syscall(), using the register structure we just saved on the stack as argument.
Restore to the process's page table, and return to the process.

In the syscall() function we copy the register structure passed in to the regs field in the process descriptor.

When a system call "returns", you can think of it as that all register values in current->regs gets copied (or restored) to the actual processor's registers before the process resumes execution.

System call handing is a bit like programming with exceptions. If you have experience with exceptions in another programming language, it may help you understand system calls. Every time a new system call occurs and the handler gets executed, the handler is not aware of any prior invocations of the same handler, unless mediated by some other state explicitly managed by the handler. The entire kernel stack gets "blown away" once a system call "returns". You can also think of it as a event-driven programming model.

Types of protected control transfers

Interrupts: caused by non-CPU hardware (e.g. timer)
Traps: caused by software intentionally (syscall)
Faults: caused by software error

It's worth pointing out that all these control transfers are administered by the x86 CPU.

When a process "returns" from a protected control transfer, kernel restores its %rip register to point to:

Interrupts: the next instruction that hasn't been executed yet.
Traps: the next instruction following the trap.
Faults: the problematic instruction causing the fault.

TL;DR: If a process enters the kernel because of an interrupt or trap, then once it resumes execution it will pick up from where it left off. If a process enters the kernel because of a fault, then it will retry the faulty instruction if it resumes.

Q: How can the kernel, in the system call handler, simply overwrite register values from the process? Won't that mess up the process's state and cause it to crash?

Answer: System calls are traps (also called synchronized events), which means they are explicitly invoked by the process, and the process expects a control transfer to occur and respects the calling convention of such control transfers. This is in contrast to interrupts and faults where such control transfers occur without the process's knowledge. In the case of system calls, calling convention designates certain registers to be overwritten by the kernel to convey information regarding results of the system call (%rax, for example hold return value of the system call), so the kernel are free to overwriting these registers. In fact, without using these registers, it becomes rather difficult to convey results of a system call unless the kernel exposes parts of its own memory to the process. It is worth noting though that the kernel can't just overwrite process registers arbitrarily.

Let's take a look at an example of an interrupt next, the timer interrupt ( kernel.cc:241):

case INT_TIMER:
    ++ticks;
    schedule();
    break;      /* will not be reached */

The code above is located in an exception handler, which is similar to the system call handler in terms of how it saves the process's state to the process descriptor. We see it simply increments a ticks variable, and calls the schedule() function, which picks another process to execute on the processor. Note that in this interrupt handling code no modifications were made to the process's register state. The process does not expect a timer interrupt to occur so we had better make them transparent to the process. By not modifying any registers we achieve this goal.

The sys_yeild system call, which is similar to the timer interrupt, has the following relevant code in the system call handler:

case SYSCALL_YIELD:
    current->regs.reg_rax = 0;
    schedule();             // does not return

Here we can modify the process's %rax state because again, it is a system call, and the process expects the occurrence of a control transfer and the overwriting of value in %rax by the kernel.

So, we have a super well-isolated operating system, called DemoOS, and there is absolutely nothing a program can do to take over the machine.

Absolutely nothing.

Or, is there?

Alice and Eve

We now look at two programs written by two rivals, Alice and Eve. They will be running on DemoOS.

This is Alice, in p-alice.cc:

#include "process.hh"
#include "lib.hh"

void process_main() {
    char buf[128];
    sys_getsysname(buf);
    app_printf(1, "Welcome to %s\n", buf);

    unsigned i = 0;
    while (1) {
        ++i;
        if (i % 512 == 0) {
            app_printf(1, "Hi, I'm Alice! #%d\n", i / 512);
        }
        sys_yield();
    }
}

And this is Eve, in p-eve.cc:

#include "process.hh"
#include "lib.hh"

void process_main() {
    unsigned i = 0;

    while (1) {
        ++i;
        if (i % 512 == 0) {
            app_printf(0, "Hi, I'm Eve! #%d\n", i / 512);
        }
        sys_yield();
    }
}

We can see both Alice and Eve contain infinite loops, but programs are being nice! By explicitly invoking the sys_yield() system call, they are "yielding" (i.e. letting another process to run) precious CPU time to each other so that both of them can make progress.

If our DemoOS is as wonderful as we claimed, there is absolutely nothing Eve can do to prevent Alice from running, and vice versa. The two always get about equal share of CPU resources.

Attack 1: No yielding any more

If Eve stops calling sys_yeild(), then Eve no longer actively yields the CPU to any other program, and it takes over the entire machine.

This problem occurs because timer interrupt is properly configured. After initializing the timer interrupt, Eve no longer gets to take over the machine. Alice gets chance to run again, although Eve still visibly uses more resources from the machine by not yielding. This strictly speaking doesn't provide fairness, but it is a reasonable policy an OS may choose to implement.

Attack 2: Disabling interrupts

Eve disables interrupts by using the cli instruction. Alice again gets no CPU time after Eve successfully executes this instructions.

We fix it by disallowing processes to control interrupts. Now Eve is not able to execute cli, or it will crash.

When Eve attempts to execute the no-longer-allowed cli instruction, the hardware generates a fault and transfers control to the kernel. We made the kernel handle this fault as an exception, and it falls under case INT_GPF , or general protection fault. This is a catch-all default type of fault the hardware throws when the reason of the the error doesn't fall under any of the more specific types of fault.

Attack 3: I crash, you crash (divide by zero)

Eve then changes its program to contain a divide by zero error. When that instruction hits DemoOS crashes and nobody gets to run.

As it turns out divide by zero error triggers another hardware exception that was not handled by us. We don't really want to list all the exceptions one-by- one and write code to handle them all, because for many of them we do the exact same thing: kill the faulting process. We may be tempted to just add this code in the default case: for all unexpected exceptions, just kill the process.

We need to be careful here though, as we should really only do this if an exception stems from problems in user code. If the kernel throws an exception, it usually indicates a serious bug in the kernel and it's a bad idea to carry on with life as if it never happened. This can be done by adding a simple check under the default case:

default:
    if (regs->reg_cs & 3 != 0) {
        current->state = P_BROKEN;
    } else {
        panic("Unexpected exception %d!\n", regs->reg_intno);
    }
    break;

The least significant 2 bits of %cs register stores the privilege level the processor is running at before the fault occurred.

We can also have more fun with Eve. Imagine that we don't crash Eve's program when a divide-by-zero occurred, but to confuse her. We can do this by handling the divide-by-zero exception this way:

case INT_DIVIDE:
    current->regs.reg_rax = 61;
    current->regs.reg_rip += 2;
    break;

As per x86 specification, this should be enough to convince Eve that anything divides by zero is always 61. (The specification of the idiv instruction says that the quotient of the division is stored in %rax). We also incremented Eve's %rip because divide by zero is a fault, and %rip saved by the kernel will point to the faulty instruction, which is the division instruction. Without changing %rip Eve would re-execute the idiv instruction once it resumes. We move past the divide instruction by adding 2 to %rip (the idiv instruction is 2-byte-long). This shows the control and power the kernel has over a process.

Attack 4: Jump-to-kernel

Eve now examines the kernel assembly and finds out that the syscall entry point is located at address 0x40ac6. She then made her program write two magical bytes to that location: 0xeb 0xfe. The two bytes form an evil instruction that jumps to itself: another infinite loop attack! Now whenever a system call is made, the kernel enters an infinite loop, and the machine hangs.

Infinite loops in the kernel is particularly disastrous because the kernel usually runs with interrupts disabled.

This attack can succeed because DemoOS doesn't properly implement kernel memory isolation -- a user process can access (read and write) any kernel memory that's mapped in the process's address space. We isolate the kernel by setting the proper permissions for kernel memory:

for (vmiter it(kernel_pagetable, 0);
     it.va() < MEMSIZE_PHYSICAL;
     it += PAGESIZE) {
    // Don't set the U bit, except for the console page
    if (it.va() != (uintptr_t) console) {
        it.map(it.pa(), PTE_P | PTE_W);
    }
}

After adding protection for kernel memory, Eve crashes after attempting to overwrite the syscall entry point, and Alice gets to run till the end.

Epilogue

Process isolation is still far from achieved in DemoOS. There are several kinds of attacks Eve can still perform against Alice. Just to name a couple:

Eve can clobber Alice's memory
Fork bomb...

Thanks William for playing Eve from MIT.