Kernel 2: Process isolation and virtual memory

Process isolation

General principle: processes interact only as allowed by OS policy.

What it means:

No memory clobbering!
Fair* sharing of machine resources!
Protection of processes from each other, of kernel for processes.

Recall that the kernel is the part of the OS that runs with full machine privilege. There exists a hardened interface between the user and kernel.

Things in the user land: shell, emacs, browser
Things in the kernel land: the buffer cache, access to hardware like disk and keyboard

But processes for sure can access these hardware, or the buffer cache. These accesses are provided via the kernel by system calls.

The kernel also provides mechanisms other than system calls to enable controlled resource sharing with user-level processes. For example, access to memory, probably one of the most important piece of resources in computing, is not shared by system calls -- a process need not involve system calls to access memory. A process appears to have direct access to memory.

Virtualization and abstraction

Virtualization and abstraction are ways to provide a protected version of the interface.

In virtualization, the protected and unprotected (unprivileged and privileged) interfaces look the same.
Example: memory. Typically virtualization requires specialized hardware support.

In abstraction, the protected interface looks different from privileged interface.
Example: the file system. Implementating abstractions usually do not require special hardware support.

Q: How to choose between virtualization and abstraction when designing an interface?

Answer: For interfaces that are very frequently accessed, like memory, prefers virtualization because specialized hardware support makes them efficient. For higher-level interfaces that are less frequently used, abstraction is better because it doesn't need specialized hardware and usually one implementation works on many different hardware.

Interrupts

Recall the evil program from the last lecture, the infinite loop attack. The infinite loop attack dominates the resource of time, and we need to impose some limitation on this resource to deflect this attack.

Resource: Time
Need: Limit time in a process
Solution: Add a timeout!

We need a clock in the computer, and we do. It's called a timer. The timer hardware is configured to generate an "alarm" every millisecond. But what happens when an alarm goes off? Who should control the policy? The kernel! This means when the alarm goes off, the kernel needs to take control. So whatever program the machine was running before, once the alarm goes off, the program gets interrupted and the kernel gets to run. This kind of control transfer from a user process to the kernel is called an interrupt.

An interrupt is a hardware initiated event that suspends normal processing, saves the current processor state, and transfers control to the kernel. The interrupt saves enough state such that the suspended execution can be resumed from right where it left off, as if it was never suspended.

The kernel then has the freedom to choose the next step to take. It can resume execution of the suspended program, or it can pick another process to run. This is how kernel implements the time sharing policy.

Interrupts can be disabled using the cli instruction. For this reason, this instruction is kernel-only, and cannot be accessed from a user-level process. There are many such "dangerous" instructions that can only be accessed by the kernel, and such restrictions are enforced by the processor hardware. But how does the processor know whether or not it was the kernel that executed a dangerous instruction? How does it know what's currently executing, the kernel or the user process?

On x86, the current privilege level (CPL) in which the processor operates is stored in the %cs register. The least significant 2 bits of the %cs register represents the privilege level:

0:     Kernel (privileged)
1,2,3: Unprivileged

In most cases only level 0 and 3 are used. Dangerous instructions mentioned above can only be executed in level 0.

The %cs register, like any other register, can be written to or read from using mov and pop instructions. Instructions that change the value of this register, and therefore changing the privilege level, are also dangerous instructions and can only be accessed by the kernel.

When the system starts up, the processor is at privilege level 0. The kernel then runs and starts user-level processes in privilege level 3. The kernel's special privilege is established by the fact that it was the first program that gets to run once the hardware boots up.

Virtual memory

Virtual memory is a hardware mechanism used to isolate process memory, both from the kernel and from other processes. What it means is that an instruction attempting to move data from kernel memory or other processes' memory should not succeed.

Most modern architectures provide such guarantee through a mechanism called the page table.

Page table

Page table can be thought of a filter, through which a process accesses memory. Every process has its own page table, and the process accesses memory through its page table. We will study a simple version of a page table before moving on to introducing the real x86 page table.

Basic requirements of a page table:

Distinguish access by privilege
- U (Unprivileged) bit: this part of the filter is OK for unprivileged access
Distinguish writes from reads
- W (Writable) bit: this part of the filter is OK for writes
It's also useful to simply have some part of memory be untouchable (not present)
- P (Present) bit: this part of the filter is OK to access
Allows arbitrary rearrangement (or aliasing) of memory

Our first page table assumes a 6-bit architecture, where pages are 8 bytes long. The following is a memory address in this architecture:

3 bits	3 bits
index	offset

Every address is divided into two parts: page index and page offset. The architecture has only 8 pages in total, each identified by an index. Within a page, there are 8 addresses, and each address can be referred to using the offset.

At a high level, the page table can be thought of as implementing the following mapping:

PT(virtual address, access type) --> physical address OR fault

The hardware performs the lookup in this mapping every time a memory access occurs.

In our 6-bit architecture, the lookup proceeds as follows (with virtual address va and access type at):

Start from physical address %cr3 (location of the page table)
Access physical memory at %cr3[va >> 3] (va >> 3 is the page index)
Check access type, maybe fault, or return %cr3[va >> 3] | (va & 7) as the physical address

The page table, which is just a single page in this architecture, stores page table entries (PTEs). Each page table entry contains information about the physical address and permission bits we mentioned above:

3 bits	1	1	1
higher 3 bits of physical address	U	W	P

When a memory access is issued by a different process, the %cr3 register will hold a different value, and this look up with start from a different location (with a different page table). Accesses to the same virtual address in different processes can end up in different physical addresses because the mapping is different.