Kernel 2: Kernel isolation, protected control transfer

Overview

In lecture, we investigate our own kernel, discuss kernel isolation, and walk through a protected control transfer.

Full lecture notes on kernel — Textbook readings

Our kernel: WeensyOS

Boot files: boot*
Kernel files: kernel.cc, kernel.hh, k-*
User-mode files: u-*, p-*
Library files (shared): lib*, x86-64.h, types.h

WeensyOS commands

make run
make kill
make run-gdb + gdb -ix build/weensyos.gdb

Emulation

WeensyOS runs inside a machine emulator called QEMU
QEMU is a software program that can behave like a complete x86-64 computer system
- Interprets instructions, translates input/output to a more convenient format
- For example, instead of generating electrical signals that could be sent to a television (like the original IBM Personal Computer), QEMU’s emulated display hardware can produce commands understood by the terminal
QEMU has many other uses too

Booting: How a computer starts up

Computer turns on
Built-in hardware initializes the system
Built-in hardware loads a small program called the boot loader from a fixed location on attached storage (Flash memory, disk)
- Operating in a very constrained environment
Boot loader initializes the processor and loads the kernel

Watching the boot sequence

make run-gdb STOP=1

Kernel isolation

The kernel always retains full privilege over all machine operations
Unprivileged processes cannot stop the kernel from running
Unprivileged processes cannot corrupt the kernel

Alice and Eve

p-alice: Prints Hi, I’m Alice! and yields
p-eve: Prints Hi, I’m Eve! and yields

Why protected control transfer?

To yield (or initiate any system interaction), a process must communicate with the kernel
- Processes are isolated
- Kernel mediates communication according to policy
Kernel communication is a security risk
- Unprivileged process must not simply gain privilege! (Could misuse that privilege)
- Unprivileged process must not execute arbitrary kernel code! (Could crash the machine)
Protected control transfer: transfer control across a privilege boundary
- “Control” refers to control flow—the order instructions execute

System call in depth

User calls wrapper function for system call
Wrapper function prepares registers, executes syscall instruction
Processor performs protected control transfer to kernel
- Switches privilege, starts executing kernel code at pre-configured address
Kernel entry point saves processor registers so process can be restarted
Kernel syscall function handles system call

Yielding in depth

Protected control transfer

Yielding in depth 1

1. p-alice calls sys_yield

kernel1/p-alice.cc:10

void process_main() {
    unsigned n = 0;
    while (true) {
        ++n;
        if (n % 1024 == 0) {
            console_printf(0x0F00, "Hi, I'm Alice! #%u\n", n);
        }
        sys_yield();    // <- ********
    }
}

Yielding in depth 2

2. sys_yield prepares registers, executes syscall instruction

kernel1/u-lib.cc:12

__noinline int sys_yield() {
    return make_syscall(SYSCALL_YIELD);
}

kernel1/u-lib.hh:18

__always_inline uintptr_t make_syscall(int syscallno) {
    register uintptr_t rax asm("rax") = syscallno;
    asm volatile ("syscall"
            : "+a" (rax)
            : /* all input registers are also output registers */
            : "cc", "memory", "rcx", "rdx", "rsi", "rdi", "r8", "r9",
              "r10", "r11");
    return rax;
}

obj/p-alice.asm

0000000000100ba0 <sys_yield()>:
  100ba0: f3 0f 1e fa           endbr64 
  100ba4: b8 02 00 00 00        mov    $0x2,%eax       ; `SYSCALL_YIELD` defined in `lib.hh`
  100ba9: 0f 05                 syscall 
  100bab: c3                    retq

Yielding in depth 3

3. Processor performs protected control transfer

Kernel configures processor with entry point for syscall instruction during boot
When syscall instruction happens, machine switches privilege modes, kernel takes over

Why does `syscall` work this way?

Pre-configured entry point
- Kernel can harden that entry point
- Check all system call arguments carefully
- Like a city with thick walls and one fortified gate

Yielding in depth 4

4. Kernel entry point saves processor state, changes stack pointer to kernel stack

kernel1/k-exception.S:142

_Z13syscall_entryv:
        movq %rsp, KERNEL_STACK_TOP - 16 // save entry %rsp to kernel stack
        movq $KERNEL_STACK_TOP, %rsp     // change to kernel stack

        // structure used by `iret`:
        pushq $(SEGSEL_APP_DATA + 3)   // %ss
        subq $8, %rsp                  // skip saved %rsp
        pushq %r11                     // %rflags
        ...

        // call syscall()
        movq %rsp, %rdi
        call _Z7syscallP8regstate
        ...

Why switch to kernel stack?

Kernel memory is isolated from process memory
Kernel has its own call stack, functions, local variables
Worse, unprivileged process might have garbage stack pointer!
- Kernel cannot depend on process being well-behaved
- Kernel relies on memory it controls
- All accesses to process-controlled memory are checked

Yielding in depth 5

5. syscall function in kernel runs; its argument, regs, contains a copy of all processor registers at the time of the system call

kernel1/kernel.cc:218

uintptr_t syscall(regstate* regs) {
    // Copy the saved registers into the `current` process descriptor.
    current->regs = *regs;
    regs = &current->regs;
    ...
    switch (regs->reg_rax) {
    case SYSCALL_YIELD:
        current->regs.reg_rax = 0;
        schedule();

Returning from a protected control transfer

Each system call has the same “callee”: the kernel
But the kernel may not return from the system call right away
- It might need to wait
- It might run another process first
- Not like a simple function call!
Kernel must save process state to kernel memory
- Saved state allows restarting process later

Process state

kernel1/kernel.hh:25

struct proc {
    x86_64_pagetable* pagetable;        // process's page table
    pid_t pid;                          // process ID
    int state;                          // process state (see above)
    regstate regs;                      // process's current registers
    // The first 4 members of `proc` must not change, but you can add more.
};

extern proc ptable[16];

To run process I, call run(&ptable[I])
current is a pointer to the proc that most recently ran
schedule runs a process that is not current

Kernel state note

In WeensyOS, every protected control transfer resets the kernel stack
- Local variables do not persist
The run and schedule functions effectively do not return
- Instead, the kernel eventually regains control from another system call or exception

Eve attacks

kernel1/p-eve.cc:11

        if (n % 1024 == 0) {
            console_printf(0x0E00, "Hi, I'm Eve! #%u\n", n);
            while (true) {}
        }

obj/p-eve.asm

  14004e: be 6d 0c 14 00        mov    $0x140c6d,%esi
  140053: bf 00 0e 00 00        mov    $0xe00,%edi
  140058: b8 00 00 00 00        mov    $0x0,%eax
  14005d: e8 d1 0a 00 00        callq  140b33 <console_printf(int, char 
const*, ...)>
  140062: eb fe                 jmp    140062 <process_main()+0x62>    ; ****

Defending against processor time attack

Eve is monopolizing processor time
Needs hardware defense
- Implementing software defense too expensive
Timer interrupt
- An “alarm clock” that goes off N times a second
- Passes control to kernel via a privileged control transfer
- Only affects processor in unprivileged mode

Voluntary vs. involuntary privileged control transfer

syscall: Voluntary control transfer to kernel
- Process code can save some registers
- Suitable for calling convention
- Kernel can modify some registers before returning (e.g., system call return value)
Timer interrupt: Involuntary control transfer to kernel
- Can happen after any instruction whatsoever
- Process code cannot delay or prevent interrupt
- No calling convention possible
- Kernel must save and restore all processor registers accessible to processes

Setting up timer interrupts in kernel

kernel1/kernel.cc:50

void kernel_start(const char* command) {
    // initialize hardware
    init_hardware();
    init_timer(100);    // 100 Hz ***

kernel1/kernel.cc:151

void exception(regstate* regs) {
    ...
    switch (regs->reg_intno) {
    case INT_IRQ + IRQ_TIMER:
        // handle timer interrupt
        lapicstate::get().ack();    // reset timer
        schedule();                 // run a different process
}

Eve attacks kernel memory

uint8_t* ip = (uint8_t*) 0x4103c;   // address of `syscall` from `obj/kernel.sym`
ip[0] = 0xeb;
ip[1] = 0xfe;
(void) sys_getpid();