2017/Section8

From CS61
Jump to: navigation, search

Introduction

Today’s section dives into a version of the weensy OS we use in class and in your problem set 4. We’ll explore:

  1. How an operating system boots on (virtual) hardware
  2. Processes, the core abstraction in operating systems, and how the operating system leverages hardware features to manage processes
  3. Inline assembly in our C code, which lets us switch away from the C abstract machine and directly write instructions
  4. Virtual memory as a mechanism with which the operating system isolates processes from each other

Note: this will not likely work except on Linux (either a host or a Linux virtual machine).

Code

Update your cs61-sections repository with git pull origin master and cd into the s8 directory. This code contains our operating system Quotes OS. The job of the Quotes OS is to run a single process that prints out quotes.

There are several ways to run the OS:

  • make run: Build the OS and pop up a QEMU window to run it. Close the QEMU window to exit the OS.
  • make run-console: Build the OS and run QEMU in the current terminal window. Press Control-C in the terminal to exit the OS.
  • make run-gdb: Build the OS and run it in a QEMU window under GDB. A QEMU window pops up and the OS starts running, but the virtual machine stops for debugging when gdb connects. You can set breakpoints on OS or process functions (for instance, try b kernel to break at the kernel's entry point) and then type c to run the virtual machine until a breakpoint hits. Use si, x, info reg, etc. to view machine state. We compile the kernel with debugging symbols; this means even commands like next will work, and you can see the source lines corresponding to individual instructions. Close the QEMU window to exit the OS, and enter q in gdb to exit gdb.
  • make run-gdb-console: Same as make run-gdb, but runs QEMU in the current terminal window. You must run gdb yourself in a different terminal window. Run gdb -x .gdbinit from this directory.

In all of these run modes, QEMU also creates a file named log.txt. The code we hand out doesn't actually log anything yet, but you may find it useful to add your own calls to log_printf from the kernel.

Finally, run make clean to clean up your directory.

OS organization

There are many types of operating system architectures but in general, operating systems are divided into three logically distinct parts.

  1. Though typically not considered part of the operating system, the boot loader plays a short but critical part in your computer. It is a tiny bit of code (less than 510 bytes!!) that runs when a machine first boots. The boot loader is loaded in from a specific location (address) in some persistent storage like a hard disk. Its job is to set the machine to the right mode and then load the rest of the kernel, which can be a lot bigger.

    There can be multiple boot loaders stored in different locations. When a computer first turns on, it executes some hard-wired instructions in the “BIOS”—the Basic Input/Output System—that run some self-checks and initialize basic hardware. Then the BIOS examines many of the storage devices connected to the machine, in order. For each disk, it loads the first 512 bytes, then checks to see if those bytes contain a boot loader. It runs the first one it finds.

  2. The kernel is the core of the operating system. Its main goal is to enforce fast, fair, safe sharing of machine resources. The main mechanism it uses is the system call abstractions on which all other programs rely.
  3. Applications that do not run in kernel mode run in unprivileged user mode. All the programs you run are user-mode applications.

In the problem set you will mostly be modifying kernel code.

The weensy OS

The boot loader consists of two files:

  • bootstart.S: This assembly file carries the machine forward through the decades. When an x86-compatible machine first boots, it is in “real mode,” which corresponds to the ancient 8086 architecture—16-bit registers and addresses! For reasons of backward compatibility, x86-compatible machines still boot in this mode, just in case you have a really old OS you’d like to run. The bootstart.S file executes many magic instructions that put the machine into normal x86-64 mode, with 64-bit registers and addresses.
  • boot.c: This C file loads the kernel into memory starting at address 0x40000, then jumps to that address. It uses magical instructions that communicate with the disk as an I/O device.

The kernel consists of four main source files:

  • kernel.c, kernel.h: These files contain the core of the kernel. The most important functions are kernel(), which starts the kernel, and exception(), which handles exceptions (interrupts, faults, and system calls).
  • k-loader.c: This code loads application programs into memory. It’s not important for our OS today, but you will modify this file in Pset 4.
  • k-hardware.c, k-interrupt.S: These files are grotty x86-64-specific code for initializing hardware and connecting interrupts to the kernel. You probably don’t need to look at them.

The application consists of one file:

  • p-quotes.c: The quotes program.

There are some other files, too, that the kernel and the application share.

  • lib.c, lib.h: This is a tiny C library! Since we’re writing our own OS there is no standard C library that provides convenience functions like printf. In lieu, we must provide our own. Take a look; it’s fun to write your own C library, even though you should basically never ever do it (the system’s C library is better). Many of the functions here are just like those defined in C, so memset and snprintf, for example, act like they do in the standard C library. We included a few extras like MIN, MAX, ROUNDUP, and ROUNDDOWN.
  • x86-64.h: Structures, constants, and inline functions for interacting with x86-64 processor and hardware.

Stepping through the Quotes OS start-up

Let’s watch Quotes OS boot, up close and personal! Run make run-gdb.

Note: You may need to give GDB permission to read your s08/.gdbinit file. When we first run make run-gdb, we see:
warning: File "/home/kohler/cs61-sections/s08/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/home/kohler/cs61-psets/weensyos64lab/.gdbinit".
To enable execution of this file add
	add-auto-load-safe-path /home/kohler/cs61-sections/s08/.gdbinit
line to your configuration file "/home/kohler/.gdbinit".

If you see this, follow the instructions and add a similar line to your ~/.gdbinit file. For instance:

echo add-auto-load-safe-path /home/kohler/cs61-sections/s08/.gdbinit >> ~/.gdbinit
Then make run-gdb again.

This boots up the OS, but the emulated machine stops dead when gdb connects. Your terminal window is now running gdb, and the emulated machine (QEMU) is letting you set breakpoints and examine memory before it continues to boot! Hardware emulators are just amazing. (To learn more about the ways to run the OS, check out README.md.)

In this OS (unlike your problem set OS), we change the bootloader to tight-loop until gdb connects. Here’s how to get things going:

(gdb) bootc
(gdb) si

Every time you run make run-gdb for OS08, you must run bootc and then si or c (perhaps after setting breakpoints) to get the OS to proceed.

The first command sets a memory location that allows the bootloader to proceed. The second command single-steps to the next instruction! Now you can use commands like s (step) and n (next) to move one line of C code at a time, or si to step a single instruction at a time.

Step into the boot_readseg function (or set a breakpoint there), and step through several instructions. Eventually, you’ll see some weird ones like in and out.

  1. What do the outb and insl function calls do in Linux (google Linux outb for example)?
  2. We said that we don't have a standard C library so we had to write these functions ourselves. Where are they declared/defined?

These are examples of programmed I/O instructions that talk directly to attached hardware. Generally, only the kernel can execute these special instructions. System calls for reading and writing disks, or performing other I/O, often cause the kernel to eventually execute out and in instructions. I/O programming is full of mysterious hardware definitions given to us by electrical engineers and weird constants with inscrutable meaning. It’s fun. Check out boot.c for an explanation of these particular instructions.

Now set a breakpoint in the kernel itself.

(gdb) break kernel
(gdb) continue

Congratulations, you’ve now reached the kernel! Step through a couple more functions; you’ll see the kernel set up some hardware, and then go back to the disk to load the process’s code into memory (functions in k-loader.c).

Stepping through a system call

Let’s see how a process runs. Set a breakpoint at process_main and continue until it is reached:

(gdb) break process_main
(gdb) continue

Step through carefully until you’re at the call to sys_write or set a break point at sys_write. This is the wrapper for the system call that causes a message to be printed to the screen. It's analogous to the "write" system call we use in Linux. It's a very thin wrapper:

   static inline void sys_write(const char* message, size_t sz) {
       asm volatile ("int %0" : /* no result */
                     : "i" (INT_SYS_WRITE), "D" (message), "S" (sz)
                     : "cc", "memory");
   }

We'll be talking about inline assembly in a few minutes so hang on tight. Meanwhile, just know that there's a bunch of inline assembly in this and other OS kernels.

Switch to single-instruction stepping using si. What is the value of INT_SYS_WRITE (it's a macro so you're gonna have to find it)? Continue single-instruction stepping. What happened?? All of a sudden you’re in the kernel, in something called sys_int49_handler!!

Where is sys_int49_handler defined? How did we hop there in the first place?

(Most x86-64 operating systems use a syscall instruction instead of int [interrupt_number], but either works.)

How system calls work

The kernel can’t allow a process to just call it at arbitrary locations. Instead, the kernel defines specific entry points and special instructions are used to prevent user-level applications (the third logically distinct part of an OS) from accessing the kernel in any other way. On x86_64, many operating systems use the int instruction to implement system calls. That’s not for integer, it’s for interrupt.

Here’s how system calls happen in our OS. Real OSes are similar.

  1. The application code sets up its registers and stack according to the system call’s calling convention. Our OS uses the convention that %rdi holds the system call’s first argument and %rsi holds its second argument—just like the normal calling convention. This register setup is performed by the inlined assembly code we showed above.
  2. The application executes the int instruction for the system call it wants. In our example, this is int %1 where %1 is syscall_number, the value 49.
  3. The processor hardware takes control to manage this exceptional control transfer. It saves the application’s most important special-purpose registers, namely its instruction and stack pointers (%rip and %rsp) and some associated state, especially %rflags and %cs (which defines the application’s privilege level), then it loads the kernel’s instruction and stack pointers and %cs so that the kernel can run at full privilege. The application’s state is saved on a special kernel stack. Finally, it transfers control to the location the kernel defined for this interrupt/exception. All this happens at once, as an integral part of the int instruction.
  4. Now the kernel takes control, with code in k-exception.S. First, the sys49_int_handler saves the fact that exception 49 occurred and jumps to generic_exception_handler. The generic_exception_handler saves all the rest of the machine’s registers, which hold the values set by the application, to the special kernel stack.
  5. Eventually that code transfers control to the exception function in kernel.c. This function immediately saves the process’s registers into its process descriptor, a structure in memory that holds all the kernel’s information about the process.
  6. The kernel processes the system call. This may involve changing the process’s saved registers! Our convention says that system call results are returned in %rax, just like normal function call results. But system calls are not normal function calls, so we must explicitly place the result into the process descriptor slot that holds the process’s %rax. Check out the exception function in kernel.c. What does it do?
  7. When it’s done, the kernel restarts the paused process. This involves another assembly function in k-exception.S. This function loads all the application’s normal registers and finally executes another dangerous instruction, iretq, which reloads the process’s %rip and %rsp and restarts the process where it left off!

Challenges

Step through the kernel until you’re comfortable with make run-gdb and the kernel code. Try modifying p-quotes.c. Then it’s challenge time!

  • Implement a new system call, sys_getpid, that returns the current process’s ID. You will need to change lib.h (to allocate the new system call number), process.h (to implement the process’s system call), and kernel.c (to handle the new system call). Test your system call by changing the messages that get printed.
  • Change the process code to access illegal memory, such as 0xF000000000000. This will cause a segmentation fault—but on our OS, we must handle segmentation faults ourselves. Step through: What happens? Can you change the kernel to handle the fault in some way??


GCC Inline Assembly

This is heavily sourced from: https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

At this point, we've seen a bunch of functions where we "inline" assembly code in our C code. What does this do and why do we do this? In the context of our operating system and in particular, our system calls, we need to be explicit about which registers to put our arguments in!

Basic Inline

The basic way to write inline assembly is simply by:

   asm("your assembly code");

For example:

   asm("movl %ebx %eax\n\t"
       "addl %eax $1);

The big safety problem here is that you are changing the values of registers! GCC has no idea that you changed these register values and this could lead to disaster!

Extended Inline Assembly

In the extended inline assembly, we are able to specify operands in our assembly code so that we mitigate the dangers of this register clobbering by specifying our output register, our input register, and our list of modified registers.

The format is:

   asm ( "assembly" 
        : output
        : input
        : list of modified registers
       );

For example:

   int increment_0(int arg1)
   {
       int b=0;
       asm("movl %1, %%eax;
            addl $1, %%eax;
            movl %%eax, %0;"
           :"=r"(b)           /* output */
           :"r"(arg1)         /* input */
           :"%eax"            /* this register changed! */
          );
       return b;
   }

This function returns arg1 + 1 but in a rather unintuitive manner. Let's first look at each of the fields. Can we explain this without fully understanding the code?

Assembler Template

the first part is called the assembler template:

   "movl %1, %%eax;"
   "addl $1, %%eax;"
   "movl %%eax, %0;"

It contains the inline assembly to be inserted. The operands corresponding to various C expressions is represented by %0 and %1.

Operands

An operand takes the form of "constraint"(C expression). In our example, "r"(arg1) is one of them. An output operand takes an additional = modifier as in "=r"(b). If there are multiple operands, for example, in the input operands, we separate them with a ,.

In the assembly template, operands refer to as numbers in the order they appear, starting with the output operand. So %0 refers to the operand "=r"(b), %1 refers to the operand "r"(arg1). If there were a second output operand and a second input operand, then the two output operands would be referred to as %0 and %1, and the two input operands would be referred to as %2 and %3.

Here are a few examples reinforcing the concept of operands. We simplify are increment instructions above:

   int increment_1(int arg1)
   {
       int b=0;
       asm("addl $1, %1;"
           "movl %1, %0;"
           :"=r"(b)           /* output */
           :"r"(arg1)         /* input */
          );
       return b;
   }

The "r" constraint of the "r"(arg1) operand indicates that we dont care what register GCC uses as an input, so long as it has the value of arg1. We'll talk more about constraints shortly. Similarly, the "=r" constraint of the "=r"(b) operand indicates that we dont care what register GCC uses as an output, so long as it has the value in this register is b's value. We refer to the register in the output as %0, and the register in the input as %1.

We dont have a list of registers that were modified by the assembly code because we've let GCC take care of selecting those registers for us. However, if we explicitly modify registers, as we did with eax in increment_0, we need to make a note of it.

Clobbered registers

The list of clobbered registers informs GCC that these registers are modified by the programmer so that GCC does not assume the values in these registers are valid. We don't add registers specified in the input and output operands because GCC already knows about them. We do include any other register used in the assembly code though.

For example, what registers are the inputs, and what registers are modified in the inline assembly code below? What operand does %0 refer to? and what operand does %1 refer to?:

       asm ("movl %0,%%eax;
             movl %1,%%ecx;
             call _foo"
            : /* no outputs */
            : "g" (from), "g" (to)
            : "eax", "ecx"
            );

In addition, from the above examples, we can specify, "cc" to tell the compiler that the flags register and "memory" to say that the assembly may perform read and write operations beyond what items are listed in the input and output operands.

Volatile

We sometimes see the volatile` keyword. It is used to force our assembly statement to execute where we put it. For example, it won't be moved from a loop as an optimization.

Constraints

The constraint in an operand is surrounded by quotes as in "r"(arg1). There are many types of constraints and so for the sake of brevity and simplicty, we will focus on register constraints. Register constraints allow you to tell which registers GCC should put data into. Here are the register constraints:


   +---+--------------------+
   | r |    Register(s)     |
   +---+--------------------+
   | a |   %eax, %ax, %al   |
   | b |   %ebx, %bx, %bl   |
   | c |   %ecx, %cx, %cl   |
   | d |   %edx, %dx, %dl   |
   | S |   %esi, %si        |
   | D |   %edi, %di        |
   +---+--------------------+

Given what the table above, what do you expect the inline assembly compiles to? Use objdump to investigate:

   int increment_register(int arg1){
       int b = 0;
       asm("addl $1, %1;\n\t"
           "movl %1, %0;\n\t"
           :"=a"(b)           /* output */
           :"b"(arg1)         /* input */
          );
       return b;
   }

Constraint modifiers provide additional control to the use of the operand. We've seen one of them: = indicates that this operand is write-only for this instruction; the previous value is discarded and replaced by output data. Another is & which indicates that this operand gets clobbered BEFORE all inputs have been used.

Challenges

  1. Write a version of strncpy using inline assembly.
  2. The exceptional control transfer expects that the arguments to system calls are found in the following registers: eax, ecx, and edx. Write a function that uses inline assembly to organize and call the interrupt. Include the function signature as well.
  3. Write a version of the write Linux system call. The function you write should conform to the function signature of the write system call. You should use inline assembly to arrange arguments for the system call.

Virtual Memory

x86_64 has a 64-bit address space. From an intuitive standpoint, it should be able to access 2^64 bits of memory - 16 exabytes to be exact. That's about 16 x 1 million terabytes. But none of our computers have that much memory, and if each process has a 64-bit address space, how can multiple processes share the same resources? If there are multiple processes, how do so many processes' data section exist in the SAME memory address?? Enter virtual memory!

In reality, the addresses you see in gdb are not the actual addresses your computer stores data in. These are virtual addresses that are mapped to physical addresses where data are actually stored. This mapping is done using a page table and is performed by the memory management unit of your CPU. This is a short introduction to these concepts.


Virtual Addresses

Although x86_64 has a 64-bit address, it only uses 48 of those bits. These bits are further grouped into 5 parts:

  • Bits 0 through 11 (the least significant 12 bits) are the page offset. We’ll say PAGEOFFSET for short.
  • Bits 12 through 20 (the next most significant 9 bits) are the level-4 page table index. We’ll say L4PAGEINDEX for short.
  • Bits 21 through 29 (the next significant 9 bits) are the level-3 page table index. We’ll say L3PAGEINDEX for short.
  • Bits 30 through 38 (the next most significant 9 bits) are the level-2 page table index. We’ll say L2PAGEINDEX for short.
  • Bits 39 through 47 (the next most significant 9 bits) are the level-1 page table index. We’ll say L1PAGEINDEX for short.

Below is an illustration of the bits in a 64 bit virtual address:

 63             48 47     39 38     30 29     21 20     12 11         0
 +----------------+---------+---------+---------+---------+------------+
 |                | LEVEL 1 | LEVEL 2 | LEVEL 3 | LEVEL 4 | PAGE OFFSET|
 |                | PAGE    | PAGE    | PAGE    | PAGE    |            |
 |                | INDEX   | INDEX   | INDEX   | INDEX   |            |
 +----------------+---------+---------+---------+---------+------------+

The C code below shows how to calculate each value from a 64-bit memory address.

#define PAGESIZE (1 << 12)

uint64_t PAGEOFFSET(uint64_t va) {
   return va & (PAGESIZE - 1);
}

uint64_t L4PAGEINDEX(uint64_t va) {
   return (va >> 12) & 0x1FF;
}

uint64_t L3PAGEINDEX(uint64_t va) {
   return (va >> 21) & 0x1FF;
}

uint64_t L2PAGEINDEX(uint64_t va) {
   return (va >> 30) & 0x1FF;
}

uint64_t L1PAGEINDEX(uint64_t va) {
   return (va >> 39) & 0x1FF;
}

If you have an address in hexadecimal 0xDEFGHIJKLMNOPQRST (where each capital letter represents a hexadecimal digit), then:

  • PAGEOFFSET(0xDEFGHIJKLMNOPQRST) == 0xRST. (the lower 3 hex digits.)
  • L4PAGEINDEX(0xDEFGHIJKLMNOPQRST) == (0xOPQ & 0x1FF). (the next 3 hex digits, except that you need to mask off the top three bits, to get a number between 0x000 and 0x1FF)
  • L3PAGEINDEX(0xDEFGHIJKLMNOPQRST) == (0xMNO & 0x3FE). (the next 3 hex digits, but you mask off the bottom bits because it is used by the L4 and the top two bits for a total of 9)
  • L2PAGEINDEX(0xDEFGHIJKLMNOPQRST) == (0xKLM & 0x7FC). (mask off the bottom 2 bits already used by the L3 page table and the top bit.
  • L1PAGEINDEX(0xDEFGHIJKLMNOPQRST) == (0xIJK & 0xFF8). (mask off the bottom 3 bits used by the L2 page table).

Questions

Part A. What is:

  1. PAGEOFFSET(0)?
  2. L2PAGEINDEX(0)?
  3. L1PAGEINDEX(0x9000)?
  4. PAGEOFFSET(1023)?
  5. L4PAGEINDEX(0x0000000070000000)?
  6. L4PAGEINDEX(0x0000000070000FFF)?
  7. L3PAGEINDEX(0x0000000070000FFF)?
  8. L2PAGEINDEX(0x0000000070000FFF)?
  9. PAGEOFFSET(0x0000000000801000)?
  10. L4PAGEINDEX(0x00F0089A00801000)?
  11. L1PAGEINDEX(0x00F0089A00801000)?
  12. PAGEOFFSET(0x0080100000F0089A)?
  13. L3PAGEINDEX(0x0080100000F0089A)?
  14. L2PAGEINDEX(0x0080100000F0089A)?

Part B. For each question, give a virtual address that satisfies the constraints.

  1. PAGEOFFSET = 0
  2. PAGEOFFSET = 0 and L1PAGEINDEX = 12
  3. L1PAGEINDEX = 5, L2PAGEINDEX=64, L3PAGEINDEX=32 L4PAGEINDEX = 128, PAGEOFFSET = 256

Part C. Your problem set defines the following function. Can you define it as a single expression?

int PAGEINDEX(uint64_t va, int level) {
   // Returns the level page index of va.
   // level==0 returns L1PAGEINDEX, level==1 returns L2PAGEINDEX, etc.
   return YOUR CODE HERE;
}

Pages

 +----------------+---------+---------+---------+---------+------------+
 |                | LEVEL 1 | LEVEL 2 | LEVEL 3 | LEVEL 4 | PAGE OFFSET|
 |                | PAGE    | PAGE    | PAGE    | PAGE    |            |
 |                | INDEX   | INDEX   | INDEX   | INDEX   |            |
 +----------------+--+------+---------+---------+---------+------------+
                     |
                     | +-------+
                     | | L1    |
                     | | PAGE  |
                     | | TABLE |
                     | +-------+
                     +->       |
                       +-------+
                       |       |
                       +-------+

The value of the L1 page index identifies an entry in the L1 Page Table. There are 512 entries in L1 Page Table. Each entry is a 8 bytes. Given this, the L1 page table is 512 entries * 8 bytes per entry = 4096 bytes large. Each of this page table's contains the address to another page table. This next page table is the Level 2 page table. Other L2 page tables may also exist but for this particular address, only the L2 page table pointed to by the L1 page table identified by the L1 index is relevant.

 +----------------+---------+---------+---------+---------+------------+
 |                | LEVEL 1 | LEVEL 2 | LEVEL 3 | LEVEL 4 | PAGE OFFSET|
 |                | PAGE    | PAGE    | PAGE    | PAGE    |            |
 |                | INDEX   | INDEX   | INDEX   | INDEX   |            |
 +----------------+--+------+---------+---------+---------+------------+
                     |
                     | +-------+
                     | | L1    |
                     | | PAGE  |
                     | | TABLE |
                     | +-------+
                     +->       +------->+------+
                       +-------+        |      |
                       |       +----+   |      |
                       +-------+    |   |L2    |
                       |       |    |   |PAGE  |
                       +----+--+    |   |TABLE |
                            |       |   |      |
                          +-v-+   +-v-+ |      |
                          |   |   |   | |      |
                          |   |   |   | |      |
                          |   |   |   | +------+
                          +---+   +---+
 

Just as the L1 Page Index is the index into the L1 page table containing the page table entry (PTE) that points to the appropriate L2 page table for this address, the L2 page index (bits 30-38 in the address) contains the index of L2 page table that points to the appropriate L3 page table! And the same thing goes on until we reach the L4 page table.

 +----------------+---------+---------+---------+---------+------------+
 |                | LEVEL 1 | LEVEL 2 | LEVEL 3 | LEVEL 4 | PAGE OFFSET|
 |                | PAGE    | PAGE    | PAGE    | PAGE    |            |
 |                | INDEX   | INDEX   | INDEX   | INDEX   |            |
 +----------------+---------+--+------+---------+---------+------------+
                               |
                               |
                               | +-------+
                               | | L2    |
                               | | PAGE  |
                               | | TABLE |
                               | +-------+
                               +->       +------->+------+
                                 +-------+        |      |
                                 |       +----+   |      |
                                 +-------+    |   |L3    |
                                 |       |    |   |PAGE  |
                                 +----+--+    |   |TABLE |
                                      |       |   |      |
                                    +-v-+   +-v-+ |      |
                                    |   |   |   | |      |
                                    |   |   |   | |      |
                                    |   |   |   | +------+
                                    +---+   +---+
 

In summary, the data structured used to move from table to table is a radix tree and it has a fan-out of 512. The L1 page table is the root node of this tree and can have up to 512 child nodes consisting of L2 page tables. Each L2 page table has up-to 512 child nodes consisting of L3 page tables. Each L3 page table has up-to 512 child nodes consisting of L4 page tables. All of these page tables occupy a single physical memory page and all page tables are arrays of 512 8-byte addresses.

With this in mind, there are a few important points to remember:

  • The x86_64 page size is 4096 bytes = 0x1000 bytes.
  • A page is 4096 bytes of contiguous memory whose first address is a multiple of 4096. For instance, we might refer to “physical page 0x1000.” This means the physical page of memory comprising addresses 0x1000 through 0x1FFF.
  • Not every 4096 bytes of contiguous memory is a page. For example, consider 4096 bytes of memory starting at physical address 0x1800. The byte range is 0x1800 through 0x27FF. This overlaps with two physical pages: the portion from 0x1800–0x1FFF is part of physical page 0x1000, and the portion from 0x2000-0x27FF is part of physical page 0x2000.
  • Any single byte belongs to exactly one page. Two contiguous bytes may belong to 1 or 2 physical pages, it depends how they’re aligned.
  • Each CPU has a special register called the Current Page Table Register. It holds the physical address of the currently active level-1 page table (referred to as the Page Map Level 4). This register is called %cr3.

Virtual Address Translation

To translate a 64-bit virtual address to a physical address, the following steps occur. Note that we are ignoring page table entry flags for now.

  1. Split the virtual address into its components (PAGEOFFSET, L4PAGEINDEX, L3PAGEINDEX, L2PAGEINDEX, L1PAGEINDEX).
  2. Start at the root page table L1, the address of the L2 page table is at L1[L1PAGEINDEX].
  3. Moving to L2, the address of the L3 page table is at L2[L2PAGEINDEX]
  4. Moving to L3, the address of the L4 page table is at L3[L3PAGEINDEX]
  5. Moving to L4, the Physical Page Number (PPN) is at L4[L4PAGEINDEX]. This points to the physical page where the data can actually be found.
  6. The starting physical byte where the data is actually stored is at address is PPN + PAGEOFFSET.

In C pseudocode, we might write it this way. Again, this version is incomplete because we are ignoring flags.

typedef struct x86_64_pagetable {
   x86_64_pageentry_t entry[512];
} x86_64_pagetable;

typedef struct vamapping {
   uintptr_t pa;
} vamapping;

vamapping virtual_memory_lookup(x86_64_pagetable* pagetable, uintptr_t va) {
   unsigned l1idx = L1PAGEINDEX(va);
   x86_64_pageentry_t l1pte = pagetable->entry[l1idx];   // physical lookup

   unsigned l2idx = L2PAGEINDEX(va);
   x86_64_pageentry_t l2pte = ((x86_64_pagetable*) l1pte)->entry[l2idx];   // physical lookup

   unsigned l3idx = L3PAGEINDEX(va);
   x86_64_pageentry_t l3pte = ((x86_64_pagetable*) l2pte)->entry[l3idx];   // physical lookup

   unsigned l4idx = L4PAGEINDEX(va);
   x86_64_pageentry_t PPN = ((x86_64_pagetable*) l3pte)->entry[l4idx];   // physical lookup

   return (vamapping) { pa: PPN + PAGEOFFSET(va) };
}

The full picture of the process of virtual address translation can be visualized as:

Virtual address translation.PNG

Page Table Entries and Flags

We mentioned that we ignored page table entry flags in the above examples. We'll discuss them briefly here. Each page table entry actually has a bunch of information in addition to the address of the page table. The L4 page table entries look like this (from our the 2nd version of our text book):

L4pte.PNG

The L1 through L3 page table entries look very similar. The lower 12 bits of every entry are used for flags. These flags determine whether pages exist and set access permissions. Of particular relevance to the class are the first three bits of the entry we usually call permission bits.

PTE_P == 1: Present. If the first bit is set, this virtual page is present in physical memory. If not set, the page doesn’t exist (and the rest of the entry isn’t used).

PTE_W == 2: Writable. If the second bit is set, this virtual page may be written. If not set, any attempt, by the kernel or by a process, to write to this virtual page will cause a page fault.

PTE_U == 4: Unprivileged (or User ). If the third bit set, this virtual page may be accessed by unprivileged user processes. If not set, any attempt by a process to read or write this virtual page will cause a page fault. The kernel can still use the page.

With flags, the lookup procedure within each page table entry is as follows:

  1. Split virtual adress into its components (PAGEOFFSET, L4PAGEINDEX, L3PAGEINDEX, L2PAGEINDEX, L1PAGEINDEX).
  2. Start at the root page table L1, the page table entry, L1PTE is at L1[L1PAGEINDEX].
  3. Examine L1PTE's permission bits:
    1. If the present bit is not set, induce a page fault exception.
    2. If the write bit is not set and the access is a write, induce a page fault exception.
    3. If the privilege bit is not set and the access is by an unprivileged process, induce a page fault exception.
  4. Otherwise, clear the lowest 12 bits of the page table entry and treat the result as the physical address of another page table (the L2).
  5. Repeat from step 2 for the child (L2 through L4) page tables.

The following is an example of code that would follow this process for a two level page table scheme on a 32-bit x86 machine. How would we modify this code for x86_64?

vamapping virtual_memory_lookup(x86_pagetable* pagetable,
                 bool iswrite, bool isunprivileged,
                 uintptr_t va) {
   uintptr_t l1pte = pagetable->entry[L1PAGEINDEX(va)];
   if (!(l1pte & PTE_P)
       || (iswrite && !(l1pte & PTE_W))
       || (isunprivileged && !(l1pte & PTE_U)))
       return (vamapping) { isfault: 1};
   l1pte &= ~(PAGESIZE - 1);

   uintptr_t l2pte = ((x86_pagetable*) l1pte)->entry[L2PAGEINDEX(va)];
   if (!(l2pte & PTE_P)
       || (iswrite && !(l2pte & PTE_W))
       || (isunprivileged && !(l2pte * PTE_U)))
       return (vamapping) { isfault: 1 };
   l2pte &= ~(PAGESIZE - 1);
   return (vamapping) { isfault: 0, pa: l2pte + PAGEOFFSET(va) };
}