Overview
In lecture, we investigate our own kernel, discuss kernel isolation, and walk through a protected control transfer.
Full lecture notes on kernel — Textbook readings
Our kernel: WeensyOS
- Boot files:
boot*
- Kernel files:
kernel.cc
,kernel.hh
,k-*
- User-mode files:
u-*
,p-*
- Library files (shared):
lib*
,x86-64.h
,types.h
WeensyOS commands
make run
make kill
make run-gdb
+gdb -ix build/weensyos.gdb
Emulation
- WeensyOS runs inside a machine emulator called QEMU
- QEMU is a software program that can behave like a complete x86-64 computer system
- Interprets instructions, translates input/output to a more convenient format
- For example, instead of generating electrical signals that could be sent to a television (like the original IBM Personal Computer), QEMU’s emulated display hardware can produce commands understood by the terminal
- QEMU has many other uses too
Booting: How a computer starts up
- Computer turns on
- Built-in hardware initializes the system
- Built-in hardware loads a small program called the boot loader from a fixed location on attached storage (Flash memory, disk)
- Operating in a very constrained environment
- Boot loader initializes the processor and loads the kernel
Watching the boot sequence
make run-gdb STOP=1
Kernel isolation
- The kernel always retains full privilege over all machine operations
- Unprivileged processes cannot stop the kernel from running
- Unprivileged processes cannot corrupt the kernel
Alice and Eve
p-alice
: PrintsHi, I’m Alice!
and yieldsp-eve
: PrintsHi, I’m Eve!
and yields
Why protected control transfer?
- To yield (or initiate any system interaction), a process must communicate with the kernel
- Processes are isolated
- Kernel mediates communication according to policy
- Kernel communication is a security risk
- Unprivileged process must not simply gain privilege! (Could misuse that privilege)
- Unprivileged process must not execute arbitrary kernel code! (Could crash the machine)
- Protected control transfer: transfer control across a privilege boundary
- “Control” refers to control flow—the order instructions execute
System call in depth
- User calls wrapper function for system call
- Wrapper function prepares registers, executes
syscall
instruction - Processor performs protected control transfer to kernel
- Switches privilege, starts executing kernel code at pre-configured address
- Kernel entry point saves processor registers so process can be restarted
- Kernel
syscall
function handles system call
Yielding in depth
Yielding in depth 1
1. p-alice
calls sys_yield
void process_main() {
unsigned n = 0;
while (true) {
++n;
if (n % 1024 == 0) {
console_printf(0x0F00, "Hi, I'm Alice! #%u\n", n);
}
sys_yield(); // <- ********
}
}
Yielding in depth 2
2. sys_yield
prepares registers, executes syscall
instruction
__noinline int sys_yield() {
return make_syscall(SYSCALL_YIELD);
}
__always_inline uintptr_t make_syscall(int syscallno) {
register uintptr_t rax asm("rax") = syscallno;
asm volatile ("syscall"
: "+a" (rax)
: /* all input registers are also output registers */
: "cc", "memory", "rcx", "rdx", "rsi", "rdi", "r8", "r9",
"r10", "r11");
return rax;
}
obj/p-alice.asm
0000000000100ba0 <sys_yield()>:
100ba0: f3 0f 1e fa endbr64
100ba4: b8 02 00 00 00 mov $0x2,%eax ; `SYSCALL_YIELD` defined in `lib.hh`
100ba9: 0f 05 syscall
100bab: c3 retq
Yielding in depth 3
3. Processor performs protected control transfer
- Kernel configures processor with entry point for
syscall
instruction during boot - When
syscall
instruction happens, machine switches privilege modes, kernel takes over
Why does syscall
work this way?
- Pre-configured entry point
- Kernel can harden that entry point
- Check all system call arguments carefully
- Like a city with thick walls and one fortified gate
Yielding in depth 4
4. Kernel entry point saves processor state, changes stack pointer to kernel stack
_Z13syscall_entryv:
movq %rsp, KERNEL_STACK_TOP - 16 // save entry %rsp to kernel stack
movq $KERNEL_STACK_TOP, %rsp // change to kernel stack
// structure used by `iret`:
pushq $(SEGSEL_APP_DATA + 3) // %ss
subq $8, %rsp // skip saved %rsp
pushq %r11 // %rflags
...
// call syscall()
movq %rsp, %rdi
call _Z7syscallP8regstate
...
Why switch to kernel stack?
- Kernel memory is isolated from process memory
- Kernel has its own call stack, functions, local variables
- Worse, unprivileged process might have garbage stack pointer!
- Kernel cannot depend on process being well-behaved
- Kernel relies on memory it controls
- All accesses to process-controlled memory are checked
Yielding in depth 5
5. syscall
function in kernel runs; its argument, regs
, contains a copy
of all processor registers at the time of the system call
uintptr_t syscall(regstate* regs) {
// Copy the saved registers into the `current` process descriptor.
current->regs = *regs;
regs = ¤t->regs;
...
switch (regs->reg_rax) {
case SYSCALL_YIELD:
current->regs.reg_rax = 0;
schedule();
Returning from a protected control transfer
- Each system call has the same “callee”: the kernel
- But the kernel may not return from the system call right away
- It might need to wait
- It might run another process first
- Not like a simple function call!
- Kernel must save process state to kernel memory
- Saved state allows restarting process later
Process state
struct proc {
x86_64_pagetable* pagetable; // process's page table
pid_t pid; // process ID
int state; // process state (see above)
regstate regs; // process's current registers
// The first 4 members of `proc` must not change, but you can add more.
};
extern proc ptable[16];
- To run process
I
, callrun(&ptable[I])
current
is a pointer to theproc
that most recently ranschedule
runs a process that is notcurrent
Kernel state note
- In WeensyOS, every protected control transfer resets the kernel stack
- Local variables do not persist
- The
run
andschedule
functions effectively do not return- Instead, the kernel eventually regains control from another system call or exception
Eve attacks
if (n % 1024 == 0) {
console_printf(0x0E00, "Hi, I'm Eve! #%u\n", n);
while (true) {}
}
obj/p-eve.asm
14004e: be 6d 0c 14 00 mov $0x140c6d,%esi
140053: bf 00 0e 00 00 mov $0xe00,%edi
140058: b8 00 00 00 00 mov $0x0,%eax
14005d: e8 d1 0a 00 00 callq 140b33 <console_printf(int, char
const*, ...)>
140062: eb fe jmp 140062 <process_main()+0x62> ; ****
Defending against processor time attack
- Eve is monopolizing processor time
- Needs hardware defense
- Implementing software defense too expensive
- Timer interrupt
- An “alarm clock” that goes off N times a second
- Passes control to kernel via a privileged control transfer
- Only affects processor in unprivileged mode
Voluntary vs. involuntary privileged control transfer
syscall
: Voluntary control transfer to kernel- Process code can save some registers
- Suitable for calling convention
- Kernel can modify some registers before returning (e.g., system call return value)
- Timer interrupt: Involuntary control transfer to kernel
- Can happen after any instruction whatsoever
- Process code cannot delay or prevent interrupt
- No calling convention possible
- Kernel must save and restore all processor registers accessible to processes
Setting up timer interrupts in kernel
void kernel_start(const char* command) {
// initialize hardware
init_hardware();
init_timer(100); // 100 Hz ***
void exception(regstate* regs) {
...
switch (regs->reg_intno) {
case INT_IRQ + IRQ_TIMER:
// handle timer interrupt
lapicstate::get().ack(); // reset timer
schedule(); // run a different process
}
Eve attacks kernel memory
uint8_t* ip = (uint8_t*) 0x4103c; // address of `syscall` from `obj/kernel.sym`
ip[0] = 0xeb;
ip[1] = 0xfe;
(void) sys_getpid();