Kernel 3: Real VM

Remember that the goal of virtual memory is to give each process an independent and isolated view of memory.

In abstract terms, virtual memory introduces a layer of indirection between the addresses that programs use, and the addresses wired in to the machine’s physical memory chips. You can model this layer of indirection as a function:

vmlookup(virtual_address, access_type) -> physical_address OR fault;

The processor executes this function every time it accesses memory. This is not done in software, it’s done in hardware. So given an instruction like movl (%rsp), %rax, the processor first executes the equivalent of vmlookup(%rsp, ACCESS_IS_READ) to find the actual physical address of the memory to load. All pointer variables in valid C programs on x86-64 contain virtual addresses, and all addresses stored in machine registers are virtual addresses (with one exception; see below).

The mechanism used by many modern processors, including x86, x86-64, and ARM, to implement virtual memory is hardware page tables. A hardware page table is a structure somewhat like on a radix tree, stored in physical memory, and used by the processor’s memory management unit to translate virtual addresses to physical addresses. Software sets up a page table in memory and loads the page table’s physical address into a special-purpose machine register, %cr3. Then, after that, the hardware walks the page table to translate all future virtual address lookups to physical addresses.

(Note that processors have some modes where virtual memory is turned off. x86-64 chips, for example, start off with no memory protection; it’s the job of the boot loader, and/or initial parts of the kernel, to turn on virtual memory by supplying an initial page table.)

Paged virtual memory in x86-64

(This material is discussed in depth in CS:APP3e Chapter 9, especially §9.6-9.7.)

x86-64 uses a multi-level page table structure to translate virtual to physical addresses. Specifically, it uses a four-level page table. Every vmlookup operation requires walking over up to 4 different aligned pages of memory to find the final address.

An x86-64 virtual address contains 48 non-redundant bits of address data. (Bits 48 through 63—the top 16 bits—must hold copies of bit 47. That is, the valid x86-64 addresses, considered as signed numbers, range from -140737488355328 through 140737488355327, inclusive; or in hex, 0xFFFF800000000000 to 0x00007FFFFFFFFFFF, inclusive.) The x86-64 VM system divides this data into five regions, four indexes and an offset, as follows.

      47     39 38     30 29     21 20     12 11         0
...--+---------+---------+---------+---------+------------+
     |wwwwwwwww|xxxxxxxxx|yyyyyyyyy|zzzzzzzzz|aaaaaaaaaaaa|
...--+---------+---------+---------+---------+------------+
        9 bits    9 bits    9 bits    9 bits     12 bits
          |         |         |         |           |
          |         |         |         |           \--------> Offset         (PAGEOFFSET(va))
          |         |         |         |
          |         |         |         \--------------------> Level 4 index  (L4PAGEINDEX(va))
          |         |         |
          |         |         \------------------------------> Level 3 index  (L3PAGEINDEX(va))
          |         |
          |         \----------------------------------------> Level 2 index  (L2PAGEINDEX(va))
          |
          \--------------------------------------------------> Level 1 index  (L1PAGEINDEX(va))

To implement vmlookup translation, the processor uses these indexes in turn, from most-significant to least-significant, to look up page table entries in a series of page table pages. The page table as a whole is the collection of page table pages referenced from the top-level, level-1 page table page. Each page table page consists of an array of entries (on x86-64, a page table page comprises 512 entries). And each entry specifies a physical address saying where to go next. Level-1 entries specify the physical addresses of level-2 page table pages; level-2 entries specify the physical addresses of level-3 page table pages; level-3 entries specify the physical addresses of level-4 page table pages; and level-4 entries specify the physical addresses of destination physical pages, where the actual memory resides.

Here’s how the processor implements a virtual address lookup.

First, the processor finds CURPT, the physical address of the current level-1 page table page, by reading the value of %cr3. CURPT (and therefore %cr3) is a physical address that’s aligned to a multiple of 4096 bytes: its lower-order 12 bits are always zero.
The processor treats the memory at physical address CURPT as an array of 512 8-byte units. If the virtual address’s level-1 index is I1, it will look up entry CURPT[I1]. Call this entry E.
The processor checks if entry E allows this access. If it does not allow it, the processor faults. Otherwise, E contains a physical address of the level-2 page table page, which the processor stores in CURPT.
The processor repeats these steps for levels 2 through 4.
1. It sets E := CURPT[I2], where I2 is the level-2 index.
2. If E disallows the access, it faults; otherwise it sets CURPT to the level-3 page table page physical address.
3. It sets E := CURPT[I3], where I3 is the level-3 index.
4. If E disallows the access, it faults; otherwise it sets CURPT to the level-4 page table page physical address.
5. It sets E := CURPT[I4], where I4 is the level-4 index.
6. If E disallows the access, it faults; otherwise it sets CURPT to the destination physical page physical address.
Now CURPT is an address of the destination physical page, which is treated as an array of bytes (not an array of 8-byte words). The processor adds the offset—the least-significant 12 bits of the virtual address—to this address to obtain the final physical address of the memory in question.

Sizes and alignments

So why these particular sizes and alignments? Why are indexes 9 bits and offsets 12?

Virtual memory architectures tend to fit together like puzzles. The key to the puzzle is the page size: the fundamental unit in which memory is allocated. On x86-64, like x86-32, the page size is 4096, or 2¹², bytes. Everything else follows from this choice.

Since the page size is 2¹² bytes, the offset is 12 bits long.
Each page table page must be large enough to fit one entry per possible index; and additionally, it is extremely convenient for page table pages to have the same size as normal pages. The page size is 4096 bytes, and the pointer size (which is the same as the physical address size) is 8 bytes. That means a page can hold 4096/8 = 512 = 2⁹ uncompressed entries. Thus, each index should be at most 9 bits long.

Page table entries must also support flags that help the VM system determine whether an entry allows an access. Since all memory pages are aligned—every physical page address is a multiple of 2¹²—the least-significant 12 bits of each page address must be zero. Rather than require the user to store zero explicitly, though, the virtual memory system implicitly sets the bits to zero when loading them, and allows entries to use that space for flags. The upper 16 bits (bits 48–63) are also available for flags.

Flags

The x86-64 uses flag bits in each entry to control whether an access is allowed. If an access is disallowed by flags, then the hardware will fault.

The important bits are:

PTE_P (value 1): This entry is Present. If this bit is absent, then every access is disallowed.
PTE_W (value 2): This entry is Writable. If this bit is absent, then write accesses are disallowed.
PTE_U (value 4): This entry is Unprivileged. If this bit is absent, then accesses by unprivileged code are disallowed (the processor must have CPL 0 to access this address).
PTE_NX (value 9223372036854775808 = 2⁶³): This entry is Non-Executable. If this bit is absent, then instructions located at these addresses cannot be executed. (The book calls this XD.)

There are others; look at the book, or the x86-64 architecture manual, to learn more.

This also tells us what information is used by the memory system for the “access_type” argument to the abstract vmlookup function; specifically: (1) Is the access a read, a write, or an instruction fetch? and (2) Was the access made in unprivileged mode or privileged mode?

Bitwise arithmetic

From the diagram above, you should be able to figure out how to implement the macros PAGEOFFSET, L1PAGEINDEX, etc. that extract components from a virtual address. Here's how:

L1PAGEINDEX(addr) == ((addr >> 39) & 0x1FF)
L2PAGEINDEX(addr) == ((addr >> 30) & 0x1FF)
L3PAGEINDEX(addr) == ((addr >> 21) & 0x1FF)
L4PAGEINDEX(addr) == ((addr >> 12) & 0x1FF)
PAGEOFFSET(addr)  == (addr & 0xFFF)

Page table sizes

Page tables are generally trees, but the processor does not require this. The processor implements the algorithm above, walking down the page table level by level, without regard to the “type” of page (it’s all just memory). For instance, consider the following level-1 page table page, stored at physical address 0:

%cr3      +---------------------+
0x0 ----> | 0x0 + PTE_P + PTE_W | (index 0)
          | 0x0 + PTE_P + PTE_W | (index 1)
          | 0x0 + PTE_P + PTE_W | (index 2)
          | 0x0 + PTE_P + PTE_W | (index 3)
          | 0x0 + PTE_P + PTE_W |
          | 0x0 + PTE_P + PTE_W |
          |         ...         |
          | 0x0 + PTE_P + PTE_W |
          | 0x0 + PTE_P + PTE_W | (index 511)
          +---------------------+

This page table page produces a valid page table, even though every entry points back at the page table itself! When there is an access, the processor will successively treat the page at physical address 0 as a level-1 PTP (i.e., page table page), a level-2 PTP, a level-3 PTP, a level-4 PTP, and finally a destination physical page. This page table takes a single page of physical memory to represent, and it allows access to only a single page of physical memory.

Typically, however, we do not see level-2, level-3, and level-4 PTPs that are also used as other levels; each PTP has a natural level and it is accessed only as that level. That means that the realistic minimum size of an x86-64 page table that allows access to at least one page of memory is 4 pages, one each for each level.

It is typical for PTPs to be treated as both PTPs and as destination physical pages. The kernel must read and write PTPs in order to set up page tables! So, typically, every page table will allow access to itself—i.e., given a page table PT containing a page table page X, there exists a virtual address V so that PT maps V to the physical address of X.

The maximum size for a page table is when no two PTPs are shared and every PTP entry exists. Such a page table will have 1 level-1 PTP, 1×2⁹ = 512 level-2 PTPs, 512×2⁹ = 262144 level-3 PTPs, and 262144×2⁹ = 134217728 level-4 PTPs, for a total of 134480385 PTPs, or 550831656960 bytes (0x8040201000 bytes) of memory. That’s a lot of memory.

But very few page tables actually need to reach this size. Several techniques can be used to build a page table that allows access to a lot of memory without eating a lot of memory itself.

When a contiguous range of virtual addresses is inaccessible, the relevant level-1 (or level-2, or level-3) entries can be marked as not present (e.g., set to 0, without PTE_P). There is no need to construct level-2-4 page tables underneath not-present entries. For example, there’s a lot of vacant, inaccessible address space between global data and the heap, or between the heap and the stack; and that space doesn’t require page tables.
Often several page tables can share level-2-4 PTPs for the kernel. The kernel must exist in every address space so it can handle exceptions. In modern OSes, this is implemented by storing the entire kernel in an address range that’s shared by all processes. Since the address mappings are the same for all processes (which is safe since the mappings are kernel-only: PTE_U is off), the relevant page table pages can be shared as well. (In 32-bit architectures, there was sometimes a need for a kernel that could “jump around” from virtual address to virtual address, but in 64-bit architectures there’s enough address space that this is not necessary.)
The hardware itself offers superpage mechanisms that can convert a level-3 entry into a 2MB “destination physical superpage”, or a level-2 entry into a 1GB “destination physical megapage”, using the PTE_PS bit. (The “superpage” terminology is standard; “megapage” is not.) Using megapages, 512GB of memory can be accessed using just 2 PTPs (one level-1 and one level-2).

We can see there’s a big difference between the minimum size for a realistic page table (i.e., 4 PTPs) and the maximum size for a page table (134480385 PTPs). This is actually an advantage, and it explains why real VM systems often use this kind of a structure: typical address space layouts for processes can be expressed using relatively little memory for PTPs, but there’s enough flexibility to implement any structure.

Contrast this with a more limited page table design—say, one where offsets are 26 bits and there’s only one 22-bit index. This single-level page table has maximum size 2²⁶ = 67108864 = 0x4000000 bytes, far smaller than x86-64. But it also has minimum size 2²⁶!

Page table isolation

The big advantage of virtual memory for processes is that it lets us implement process isolation, where each process is restricted to accessing only memory it owns. We can see, though, that x86-64 hardware page tables offer a very limited permission structure: an access is either privileged or unprivileged, that’s it. But we can still implement very sophisticated virtual memory isolation using page tables, by giving different page tables to different processes! These page tables can share some substructure—and on real operating systems they will (they’ll share kernel-only page tables, for example)—but at least the level-1 PTPs must always differ.

Caching

Virtual memory lookup is complicated. To perform a translation, the hardware must perform at least 4 memory lookups! This sounds slow and it is. The solution is…a cache. The machine’s memory management unit caches virtual-to-physical mappings in a structure called the translation lookaside buffer, or TLB. The TLB is hardware-managed, but unlike the processor cache, it is not coherent: a value in the TLB can be older than the most up-to-date value from slower storage. For instance, this can happen if the memory representing the active page table changes. The lcr3 instruction can be used to flush the TLB and refresh its mappings.