CPSC 422 Spring 2010 Lab 3: Virtual Memory

Handed out Monday, February 15, 2010
Due Thursday, February 25, 2010

Introduction

In this lab, you will add paged virtual memory management to PIOS. So far PIOS's "processes" have all been running in the same address space as the kernel, with full ability to read or modify any part of physical memory, which of course makes the whole system vulnerable to bugs or misbehavior in any process even when processes are executing in user mode. We will now use the x86's page-based address translation facilities to give each process an independent user-level address space, providing remaining key ingredient for protection between processes by preventing processes from accessing either the kernel's or other processes' address spaces. We will also enhance PIOS's system call API to allow a process to copy data into and out of child processes, using copy-on-write optimization to minimize the cost of copying, These system call enhancements will allow the parent process not only to "fork" child processes with cloned address spaces as in Unix, but also — moving a step beyond typical Unix APIs — allow the parent to merge results that a child process computes directly back into the parent's own address space without having to communicate indirectly through pipes or files.

This lab contains the following implementation components:

  1. Page Table Management: Building paging structures for the kernel and for user processors, and enabling page translation.
  2. Loading and Running an ELF Executable: Loading an ELF executable image into the first user-space process and preparing it for execution.
  3. User Space Copyin/Copyout: Copying memory-based system call arguments out of or into user space while protecting against invalid arguments or traps.
  4. Memory Management with Copy-on-Write: Efficiently "copying" memory via read-only shared mappings, and lazily copying actual page content only on demand.
  5. Virtual Memory Merge: Merging memory changes made in one process since a memory snapshot into another process's address space.

Software Setup

In this lab you will build on the kernel you started in lab 2. Use the same procedure as in the previous lab to create a lab3 branch in your Git repository, fetch our skeleton source code for lab 3 from the master Git repository, and merge your lab2 solutions into it as your starting point for lab 3:

$ cd lab
$ git commit -am 'my solution to lab2'
$ git pull
$ git checkout -b lab3 origin/lab3
$ git merge lab2

Lab 3 contains the following new source files, which you should browse through and familiarize yourself with:

kern/pmap.{c,h} Page mapping and virtual memory code template
lib/entry.S Entrypoint code in assembly language for user-level processes
lib/debug.c Debugging support code for user-level processes

Part 1: Page Table Management

Before doing anything else, make sure you thoroughly understand the x86's protected-mode memory management architecture for both segmentation and page translation.

Exercise 1. Read chapters 3 of the IA-32 System Programming Guide, if you haven't done so already. Pay especially careful attention to the details in sections 3.6, 3.7, and 3.11.

Next, carefully review all the paging-related definitions in the PIOS source files inc/mmu.h and kern/pmap.h. The former in particular has a variety of macros and constants that will be extremely useful in this lab if used appropriately.

Setting Up the Kernel's Address Space

Once paging is enabled on an x86 processor, it is enabled all the time, whether the processor is executing in kernel or user mode. While some processor architectures give the kernel a "special" way to access physical memory directly while running in kernel mode, the x86 is not such an architecture. Once an x86 kernel enables paging, the only way any running code, including the kernel, can access physical memory or I/O devices is through the paged memory management system. This means that to use page memory management at all, the kernel must first initialize at least one page directory/page table structure with the mappings it will need in order to continue executing, and must load this paging structure into the processor before enabling paging.

In PIOS, the kernel expects to access physical memory and I/O devices at virtual addresses identical to their physical addresses. This means that before enabling paging we will have to create a set of identity mappings, which map a given virtual address to the same physical address.

Exercise 2. Replace the panic in pmap_init() with code to set up identity-mappings for the kernel portion of PIOS's virtual address space, which is all the virtual address space below VM_USERLO and above VM_USERHI. The easiest way to do this is to use 4MB "super-pages", as described in the IA-32 System Programming Guide. Also, since these page mappings will never need to change when the processor context switches among user processes, it is worthwhile to mark these mappings global, again as described in the System Programming Guide: that way, the processor knows not to flush these mappings when the kernel reloads the CR3 register to change address spaces.

You should initialize all page directory entries corresponding to the user-mode address space, between VM_USERLO and VM_USERHI, to the constant PTE_ZERO, defined earlier in pmap.c. Note that this constant is not the same as NULL: it is the (nonzero!) physical address of a particular page that the kernel will keep permanently set to all zeros. Using PTE_ZERO instead of NULL in unmapped page directory and page table entries will slightly simplify code later in this lab, since the kernel can safely enable read permission on PTE_ZERO mappings and allow user code to read (but not write!) this all-zero page.

Hint: Be sure to take full advantage of the many macros we have provided in inc/mmu.h to manipulate paging structures. Appropriate use of macros such as these will make your code much simpler and easier to understand.

We have included code further down in pmap_init() to enable the processor paging features that PIOS uses (namely 4MB pages and Global mappings), load the bootstrap page directory into the CR3 register, and turn on paging. Once you have correctly initialized the bootstrap page directory, you should be able to execute past the final lcr0 instruction. If something is wrong with your page directory, the processor is likely to triple-fault and reset at this instruction, because once paging is enabled with a bad page directory, the processor is unlikely to be able to do just about anything — including dispatch a page fault to the kernel's page fault handler.

Per-Process Page Directories

Now that the kernel can execute with paging enabled, it is time to start building the mapping structures we will need to give each user process its own independent address space. Recall that the x86's basic 32-bit paging system uses a two-level mapping structure: a page directory represents the entire 32-bit address space, and the page tables that the page directory refers to each contain page mappings for a 4MB (2^22 byte) address region.

In PIOS, each process will always have two page directories exclusively for the use of that process: a working page directory (proc.pdir) and a reference page directory (proc.rpdir). The working page directory is the one the kernel loads into the processor when executing that process. We will see the purpose of the reference page directory in part 5 of this lab; for now all you need to know is that both page directories need to be allocated when a new process is allocated, and both get initialized to a copy of the pmap_bootpdir: i.e., with the standard mappings for the kernel part of the address space and an empty user address space region.

Exercise 3. Add code to proc_alloc() to allocate and initialize a process's working and reference page directories when the process is created. We have pmap_newpdir() and pmap_freepdir() functions in pmap.c that may be useful.

Page Table Management

You will now implement the "boilerplate" code the kernel will need to create and manage full two-level page tables for user processes. This basic management code consists of the following three functions: Since both pmap_insert() and pmap_remove() can affect page mappings that have already been used by the processor and loaded into the processor's translation lookaside buffer (TLB), these functions also ensure that all the affected mappings are flushed from the TLB if the address space being modified is the address space of the currently running process.

In most multiprocessor operating systems, operations like pmap_insert() and pmap_remove() might have to flush not only the current processor's TLB, but the TLBs of other concurrently running processors. This procedure is known as TLB shootdown. Why do they have to do this? Think about what happens if multiple user-level threads share the same process's address space space. What characteristics of PIOS's process model make remote TLB shootdown unnecessary in PIOS's case?

Exercise 4. Implement the above three functions in kern/pmap.c. Be careful to maintain all reference counts properly, both when allocating and releasing page tables and when inserting and removing page mappings. Be careful to set all the permission bits correctly at both the page directory and page table levels. Also, when iterating through the address space in pmap_remove, be careful to handle the nontrivial cases properly: e.g., when given an address region that does not start or end on 4MB boundaries represented by particular page tables, but may nevertheless cover entire 4MB regions whose page tables must be unmapped and released.

For now, you can ignore the text in the comment for pmap_walk about copying read-shared page tables: you will deal with that, if necessary, in part 4 of the lab.

When you have completed this exercise correctly, you should be able to get through pmap_check() successfully.

Part 2: Loading and Running an ELF Executable

Now that you have code to set up paging structures, it's time to set up some paging structures: namely those required to run a "root" user-mode process in user address space and from an executable image separately linked from that of the kernel. We have provided a test program, user/testvm.c, which exercises and tests your virtual memory system, but you need to load that executable into the root process's address space and start it running.

You already encountered Executable and Linkable Format (ELF) files earlier while exploring the xv6 and PIOS boot process: the boot loader code in boot/main.c already loaded the kernel into physical memory from an ELF executable. Now you will write a very similar, and not much more complicated, loader in the kernel to load the first user-level program into memory. Now would be a good time to have a close look at the ELF specification and the definitions in inc/elf.h.

The main differences between what the boot loader did for the kernel and what your kernel code needs to do now are:

Besides describing the segments to be loaded into memory, the ELF header also indicates where the program should start executing: i.e., the user-level EIP of the first instruction in the program. In addition to loading the program itself as described by the ELF image, the kernel will need to give the root process a stack to execute on. A small one-page stack should be sufficient; the root process can later allocate a bigger stack for itself if it needs one. The high end of the user virtual memory area — the last page just before VM_USERHI — would probably be a suitable place for the root process's stack.

Exercise 5. Implement a root program loader in kern/init.c, by replacing your lab 2 code to start a root process executing at user() in the kernel's address space (which should no longer work now that the kernel's address space is inaccessible when the processor is running in user mode). The PIOS makefiles have linked directly into the kernel a binary copy of the ELF executable for the root process, which you can find using the "magic" linker-defined symbol _binary_obj_user_testvm_start referenced at the top of kern/init.c.

Hint: There are at least two general approaches to the loading process:

  1. For each virtual page affected by a program segment, allocate and map a physical page appropriately, and copy the correct data and/or zeros into that page using the page's physical address as the destination. This approach doesn't (yet) require the mappings to work, but slicing and dicing program segments into pages can require moderately complicated arithmetic.
  2. First allocate and map all the pages a program segment covers, then initialize the segment all at once by accessing it at its virtual address. This approach may make the loading arithmetic simpler, but you'll need to make sure the processor is using the correct page directory — and how do you write to the virtual mapping of a program segment that (eventually) needs to be read-only?

Hint 2: Testing your program loader is likely to reveal bugs in your mapping structure management code. If something is wrong with your mappings, the processor will probably take a page fault, so set a breakpoint at trap() to catch these. Also, make sure your process management code does something sensible when trying to "reflect" a trap that occurs in the root process: since the root process has no parent to reflect the trap to, you might want to dump the trapframe and panic, for example. When the root process "returns" gracefully via sys_ret(), however, your kernel should simply call done(), which will help the grade scripts know when the root process is finished with all tests.

Once you have the program loader working, you should be able to step through proc_run() and into the user mode process. GDB initially won't know where you are, because testvm was linked separately from the kernel and GDB only has symbols for the kernel. You can fix this problem by typing this command into GDB:

add-symbol-file obj/user/testvm 0x40000000

This will augment the debugging symbols GDB already has for the kernel with the debugging symbols contained in the ELF image obj/user/testvm. The 0x40000000 tells GDB where testvm is loaded in virtual memory. GDB requires this argument because this command is normally used to load symbol tables for shared libraries, which are normally position-independent code (PIC). PIOS's root process is not position-independent, so GDB technically doesn't need the load location argument in our case, but GDB apparently doesn't know that.

Challenge: Add proper support for page-level execute permissions when running on AMD processors, via AMD's extension for "No Execute" bits in page table entries. This isn't as easy as it may sound, unfortunately, because at the time AMD introduced this feature there were no more bits available in 32-bit page table entries, so the "No Execute" bit is available only in conjunction with Intel's "Page Address Extensions" to support 64-bit PTEs.

Background: A number of years ago, long before the introduction of true 64-bit x86 processors, the 32-bit physical address space introduced by the 80386 processor started getting cramped, and hardware vendors wanted to build PCs with more than 4GB of RAM — even though users were still running 32-bit operating systems. In response to this demand, Intel created the Page Address Extensions (PAE), which allow 32-bit operating systems to use 64GB (36 bits worth) of physical RAM. All this RAM obviously can't be mapped into the kernel's — or any single user process's — 32-bit address space at once, but it can be used if the kernel doesn't need to have all physical RAM mapped into its address space all the time (as PIOS does) and if this physical RAM is distributed among several user processes.

PAE works by rearranging the paging structures: it increases all page table entries from 32 to 64 bits in size, thus halving the number of entries per page table or page directory, while making room for more physical address bits and other features. But halving the number of entries meant one page table level could translate only 9 bits of virtual address rather than 10, thus necessiating a third (small) level of translation. This "level 3" page table, called the page directory pointer table, contains the four "page directory pointers" necessary to map a full 32-bit virtual address space with 64-bit PTEs. Thus, making use of PAE does not exactly represent a trivial change to the kernel's page table management code, although nothing has changed fundamentally.

AMD later enhanced PAE mode further by adding the ability to disable code execution at page granularity, via a new "No Execute" (NX) bit in each PTE. This was touted as a major security feature, because it makes it more difficult for viruses and other malware to exploit buffer overflow bugs by injecting code into the heap or stack and then causing that code to be executed (from the heap or stack). Both Intel and AMD now support execute-disable in the new 64-bit mode, although only AMD supports it in 32-bit PAE mode (see AMD's latest architecture manuals for details). So if you try this challenge problem, make sure you have an AMD processor to test on (or be prepared to rewrite your kernel to run in 64-bit mode)!

Part 3: User Space Copyin/Copyout

Now that virtual memory has provided us some hope of protecting the kernel's state from user-level processes and protecting processes from each other, we need to reconsider how the kernel interacts with user-level code during system calls. System calls generally have arguments, which need to be passed from user space to the kernel. PIOS system calls pass simple arguments in registers: the user-mode system call stubs in lib/syscall.c just load the arguments into the appropriate registers before executing the INT T_SYSCALL instruction, and the kernel's system call handling code in kern/syscall.c retrieves these arguments from the user-level trapframe that the trap entrypoint code pushed on the kernel stack.

Many system calls take arguments that don't fit in registers, however, such as the string argument to sys_cputs and the cpustate pointer to sys_put and sys_get. For such arguments, PIOS's user-space system call stub just leaves the argument data in user space and passes a pointer to the data in a register. The kernel then needs to read the contents of the argument data from user space — or write system call results into user space, in the case of output arguments such as the cpustate structure that sys_get fills in. But what if the user code passes an invalid pointer for such an argument? Consider what would happen in your current system call handling code if a user mode program:

The Confused Deputy Problem

This issue is one specific instance of a very general security issue called the confused deputy problem. In short, a trusted "deputy" (the kernel in this case) has multiple sets of authorities — namely the ability to access both kernel space and user virtual memory &mdash. The kernel has only one "namespace" of virtual addresses with which to access memory, however: kernel space and user process space are mixed into one 32-bit address space whose boundaries are defined only by the kernel programming conventions. Because the kernel needs to access memory under multiple authorities (i.e., wearing different "hats"), but has no way to associate the memory accesses it performs with the authority it intends to use when performing that access, without extreme care a user process can "confuse" the kernel into exercising its authority to access kernel space when it only intends to use its authority to access user space, thereby fatally compromising the kernel's security.

Security issues of this kind are a direct consequence of PIOS's use of page-level protection bits to distinguish kernel space from user space, because clearing the PTE_U in a PDE or PTE prevents user code from directly accessing privileged memory, but does nothing to prevent the kernel from accidentally accessing privileged memory when it is reading or writing data on behalf of the user using user-provided pointers.

In at least this one respect, the 80286's old segmentation-based virtual memory system may have actually provided a fundamentally cleaner and more secure design. When segmentation-based software passes pointers among system components, it typically does so using "far pointers", which encapsulate both a segment selector and an offset in a single "pointer object." The x86 processor's rules for handling segment registers ensure that user-level (ring 3) code can never load a segment associated with a higher privilege level. Unprivileged code cannot even "inherit" access to more-privileged segments from more privileged code that might accedentally leave privileged segments loaded in segment registers when returning to less privileged code, because the processor clears any such segment registers whenever it transfers control to a lower privilege level (via IRET for example). Therefore, if user code passes a far pointer to a kernel system call by passing the segment in a segment register and the offset in a normal register, the kernel may safely use that far pointer to access the user's segment, because the segment selector acts as a capability that binds the name of the memory to be accessed to the authority under which it should be accessed. In this design, user code cannot trick the kernel into accessing memory that the user code itself could not access because the user code must explicitly pass the authority with which the memory is to be accessed (the segment selector), it cannot pass an authority to which it does not itself have access (by virtue of being able to load the segment selector), and the kernel uses only the explicitly passed authority (the segment selector passed by the user) when accessing the memory named by that far pointer. See the Confused Deputy paper for other illuminating discussion of this issue.

Protecting PIOS System Calls

Since neither segmentation-based nor capability-based systems took off, however, we unfortunately live in a word in which kernels simply have to be extraordinarily careful whenever they access memory on behalf of a user process — or when any program with special privileges does anything with any kind of name supplied by a less-trusted program, for that matter. To fix PIOS's gaping security vulnerabilities, we need to do two things:

PIOS handles memory access errors caused by the user during system calls by behaving as if the INT T_SYSCALL instruction itself had caused the corresponding memory access error. For example, if the user issues a SYS_GET system call requesting that the kernel write a cpustate structure to an invalid or unmapped address, the user process will take a page fault exactly as if it had instead executed a MOV instruction that directly tried to write to the illegal address. This approach to handling memory access faults during system calls is merely PIOS's approach: Unix systems more commonly just cause the system call to return with an error code such as EFAULT in such a situation, which has advantages (it's a bit simpler for the user code to catch a simple error return than a simulated trap) and disadvantages (the user code is more likely just to ignore the return code, even though it likely indicates a serious program failure, and perhaps cause the error cause even more damage).

Exercise 6. Implement systrap, sysrecover, checkva, and usercopy in kern/syscall.c, to provide the logic necessary to access user memory safely using user-provided pointers. Then modify the system call handlers to use usercopy when copying system call argument data into or out of user space.

Part 4: Memory Management with Copy-on-Write

We have created a root process in a fully virtual address space, and finally taken proper measures to protect the kernel from user processes; what's still missing is a way for user processes themselves to change their own virtual address spaces, to set up virtual address spaces for their child processes, and to communicate data inputs and results with their children. Instead of adding more new system calls for this purpose, we will simply extend the existing GET and PUT system calls to enable the caller to perform memory management operations along with the existing functions of these system calls. This way, processes can easily combine several related operations into one GET or PUT system call for efficiency: e.g., a parent can set up both a child's register memory state and start the child running with one PUT system call, and can later synchronize with the child, retrieve its register state, and retrieve results from its virtual address space with one GET system call. The process specifies memory management operations to perform by ORing the following values into the system call command code:

Copy-on-Write and Nominal versus Actual Page Permissions

When the kernel performs a "bulk" virtual address space copy as requested by a SYS_COPY or SYS_SNAP operation, it uses copy-on-write to make the copy "appear" to happen almost instantaneously, merely after rearranging some virtual memory mappings, while deferring the actual "hard labor" of copying physical memory until the affected processes actually start writing to the copied pages. When the kernel performs a virtual copy, for each "copied" page the kernel effectively just:

  1. copies the page mapping rather than the page itself, so that the source and destination mappings actually refer to the same physical page in memory.
  2. increments the reference count on each page shared this way, to keep track of the number of references that exist to each page.
  3. sets both the source and destination mappings to read-only, to prevent either process from directly modifying the shared page and shattering the illusion that the two processes really have separate and independent copies of the page's content.
When either process tries to write to a page that was writable but is now read-only because of the kernel's copy-on-write optimization, the kernel needs handle this page fault transparently to the application. Specifically, the kernel allocates a new page, copies of the shared page's content into the new page, redirects the faulting mapping from the old, shared page to point to the new, exclusive page, and updates the pages' reference counts accordingly (releasing a reference to the old shared page and adding one to the new page). Since the new page is now owned exclusively by the process that took the page fault, the kernel can now safely reinstate write permission on the new page.

But how does the kernel keep track of which pages were writeable before it did a virtual copy, and which pages were never supposed to be writeable, e.g., because they are part of a read-only segment loaded from an ELF executable that contains program code or constant data? To make this distinction correctly we must associate two sets of permissions with each page: nominal and actual permissions.

The SYS_PERM operation described above affects the nominal page permissions of the destination memory region, which may sometimes be different from the actual page permissions of some or all of those virtual memory pages. The nominal page permissions of a range of virtual memory are the page permissions as specified and seen by user-level processes, whereas the actual page permissions are the "kernel-internal" permissions that the processor sees in the PTE_P and PTE_W bits of page table entries. A page's nominal and actual page permissions are often the same, but may differ when the kernel performs copy-on-write or other virtual memory tricks transparently to user mode code. For example, when the kernel uses copy-on-write to copy a memory page whose nominal permissions are SYS_READ | SYS_WRITE, the resulting two copies of the page mapping continue to have nominal permissions of SYS_READ | SYS_WRITE, but the kernel gives these mappings actual page permissions of only PTE_P but not PTE_W. When one of the user processes attempts to modify that page, the processor takes a page fault because the actual permissions do not allow writes. The kernel then notices that the mapping's nominal do allow writes, so it copies the shared page and raises the mapping's actual permissions once again to equal the nominal permissions.

If the user-specified nominal permissions on a page are only SYS_READ, however, then the kernel's page fault handler does not automatically raise the page's actual permissions on a write fault, since this nominal permission indicates that from the user's perspective the page should not be writable. A write to a page with nominal permissions of SYS_READ probably indicates a logical software bug — or at least means that user-level code does not want writes to happen to that page — so the kernel must reflect a write fault on that page to the parent process just as it would with other exceptions, rather than attempting to handle the fault transparently.

If you look at the definitions of SYS_READ and SYS_WRITE in inc/syscall.h and compare them with the PTE definitions in inc/mmu.h, you will notice that SYS_READ and SYS_WRITE both lie in the PTE_AVAIL part of the page table entry, which the processor ignores and leaves available for software use. This design is intentional: it makes it easy for the kernel to record each page's nominal permissions (as represented by the SYS_READ and SYS_WRITE bits) in the PTE_AVAIL portion of each page table entry, without interfering with the PTE's actual page permissions in the lower bits or with the physical address in bits 12-31.

Implementing PIOS Memory Management Operations

You now should have the information required to implement PIOS's memory management API. When implementing it, keep in mind that although it has a number of features described above, many of those features are fairly orthogonal, and their implementations can be kept relatively simple as long as you take advantage of their orthogonality. For example, even though a single PUT system call can potentially do a SYS_COPY, a SYS_PERM, and a SYS_SNAP (not to mention any or all of the other process management operations you implemented in lab 2), the kernel does these operations one at a time and independently. The implementation of one operation doesn't need to know if or how the other operations were or will be performed, as long as you simply perform them in the correct order as described above.

Exercise 7. Implement the pmap_copy() function in kern/pmap.c, which provides the basis for copy-on-write used by SYS_COPY (and SYS_SNAP, described below, but you don't have to worry about that while implementing pmap_copy()).

Hint: Keep in mind the API restriction stated above that both the source and destination memory regions are always aligned on 4MB boundaries. This API restriction makes the virtual copy operation much simpler: if you need more than about 50 lines of code, you are probably doing something wrong.

Hint: One design choice you will have to make is whether to perform copy-on-write optimization on page tables as well as on pages. Performing virtual copies of entire page tables makes the initial virtual copy operation even simpler and faster, but means that page tables may become shared read-only among multiple page directories, and you will have to modify pmap_walk() to copy and "un-share" a read-shared page table it encounters when called with the writing flag set to true. For simplicity, we recommend that you not share page tables in your initial implementation — just make pmap_copy directly copy all the individual page mappings from the source page table to the destination page table — and then perhaps try implementing copy-on-write for page tables once you have basic copy-on-write working.

Exercise 8. Now implement a kernel page fault handler to copy-on-write faults transparently to user mode code. We have provided a template function pmap_pagefault() in kern/pmap.c; you will need to fill it in and hook it into trap() at the appropriate point.

Hint: Think very carefully about exactly where the kernel's main trap handler should call pmap_pagefault(), especially in relation to the trap handler's recovery mechanism (the part that looks at the cpu_cur()->recover pointer). Suppose that a user-mode process asks the GET system call to deposit the child process's CPU register state into a memory buffer that has recently been copied via copy-on-write, and thus has nominal permissions of SYS_READ | SYS_WRITE but actual permissions of only PTE_P (but not PTE_W). How should the kernel handle the page fault that results during the system call - who is "at fault", so to speak, and who should handle the fault?

Exercise 9. Implement the memory operations SYS_ZERO and SYS_COPY, as described above. But be sure to check the memory region arguments for validity: e.g., they should not allow user processes to modify or copy data out of the kernel part of the address space. If user code attempts such evilness, the system call handler should call systrap() to issue a general protection fault (T_GPFLT).

Hint: After the above argument validity checking, SYS_ZERO is basically just a pmap_remove(), and SYS_COPY is basically just a pmap_copy().

Exercise 10. Finally, implement the SYS_PERM operation, which happens after the SYS_ZERO or SYS_COPY (if requested) in a GET or PUT system call.

Hint: Take full advantage of the fact that "empty" page table entries in new page tables allocated by pdir_walk(), as well as page table entries removed by pmap_remove(), are set to PTE_ZERO — which is not just NULL but the physical address of a page full of zeros. You can set the nominal permissions in such a PTE according to the caller's request, without actually having to allocate a page for that PTE. You can even set the actual permissions on such a PTE to PTE_P, allowing the user process to read this all-zero page. You cannot enable PTE_W in such a page mapping, of course, since that would allow the user process to scribble on this page that's only ever supposed to hold zeros. But if the user does try to write to a PTE_ZERO page with nominal permissions of SYS_READ | SYS_WRITE, just make sure your page fault handler knows how to make a new, exclusive copy of the zero page just as it would copy a read-shared page from a copy-on-write: the only real difference will be in the reference count handling, since PTE_ZERO mappings do not represent counted references. In effect, you are reusing your copy-on-write code to implement demand zero, or clearing "newly allocated" virtual pages on demand only as the user process actually starts writing to them.

Challenge: Enhance the memory operations above so that they all work on arbitrary 4KB page boundaries, not just 4MB page table boundaries as specified above. Also write some testing code to exercise all of these system calls at non-4MB boundaries, including testing "exotic" boundary cases, such as performing memory operations on a memory range that doesn't start or end on a 4MB boundary but is big enough to cover one or more complete 4MB page tables "in the middle" of the region.

Part 5: Virtual Memory Merge

You have seen how xv6 implements the Unix fork() system call, by simply copying the parent process's entire memory segment into a new memory segment for the child process. And now you have implemented, in the context of PIOS's rather different system call API, the same basic copy-on-write mechanism that all modern Unix kernels use to make their implementations of fork() efficient, especially in the common case where the child process actually writes to very few pages before it exec()s another program. Now we will take a step beyond the functionality Unix kernels offer and provide a mechanism to merge a child's results directly back into the parent's address space after executing independently for some (perhaps long) time period.

Recall that Unix's fork() effectively initializes the child process's state with a "one-time snapshot" of the parent process's state at the time of the fork(), but after that time the parent and child processes evolve separately and can communicate only via the small (typically 8-bit) return code that the child returns to the parent on exit(), or else via separate abstractions such as pipes, sockets, or files in the globally shared file system. In order to keep PIOS as simple as possible, however, we wish to minimize the number of abstractions the PIOS kernel needs to implement. Processes need some way to communicate with each other, and since processes need virtual memory in any case, PIOS uses virtual memory instead of pipes or files as the basic abstraction for inter-process communication.

The GET/PUT system call API you implemented above already provides a "coarse-grained" way for processes to communicate: a parent process can use SYS_PUT | SYS_COPY to copy itself or another program, together with suitable input data, into a child process's address space, run the child process, and then use SYS_GET | SYS_COPY to retrieve results the child left in its address space back into the parent's address space for further use. This is a coarse-grained mechanism, however, operating on a minimum granularity of 4MB address space regions (or 4KB pages even if you implement that challenge problem). And the parent has to "know" in exactly which contiguous region(s) the child will deposit its results and what region(s) in the parent's address space to copy those results to, and avoid clobbering any important data in the parent with the result data copied from the child. If the parent and child processes are closely related, the parent may prefer to have the child produce results at a much finer granularity: e.g., to have the child compute new values of a few particular word-size variables scattered throughout the parent's address space, and perhaps intermixed with memory areas that the child is not expected to modify — but which the parent itself or other children might modify in the meantime. Providing such a capability is the purpose of PIOS's virtual memory merge facility. Don't be afraid of the fancy-sounding name: it's not much more complicated than plain copy-on-write copying, but requires some explanation since it's likely to be an unfamiliar concept.

Threads Without Heisenbugs

In modern Unix kernels, tightly-coupled parallel activities would typically be implemented by multiple threads sharing the same address space, rather than by multiple processes. The motivation for threads is partly one of efficiency, and partly one of convenience: namely, threads provide exactly the capability described above, of allowing a program to "fire off" a parallel activity that computes certain values or updates certain variables that may be at arbitrary, scattered locations in the shared address space. But conventional threads come with the severe danger of heisenbugs. Because the memory access interleaving that the program sees among different threads depends on the relative execution speed of the different threads, which can vary from one run to another depending on a huge variety of subtle nondeterministic factors, the slightest error in synchronization is likely to lead to wild behavior that can be extremely difficult to reproduce and debug.

Instead of providing the heisenbug-ridden thread and shared memory abstractions of conventional kernels, therefore, PIOS instead simply enhances its support for independent processes so as to provide the convenience of threads without the nondeterminism and resulting susceptibility to heisenbugs. PIOS's virtual memory merge facility provides this capability by allowing a parent process to snapshot a child process's address space before execution, run the child process, then merge only the differences between the reference snapshot and the child's final virtual memory state back into the parent process. This mechanism operates not at a 4MB or 4KB granularity but at a machine word granularity. Thus, if the child modified a particular variable in the merged region since the parent forked it off, the new, modified value will appear back in the parent's address space at the corresponding location after the merge, allowing the parent to compute further with the new value or propagate the new data to future child processes it forks off. But for variables that the child did not change during its execution, the merge leaves the corresponding location unaffected in the parent process — even if the parent process itself, or some other child that also merges its results into the parent before or after the child in question, happens to modify that variable.

Thus, as long as the parent and its children are careful to work together and avoid modifying conflicting locations in the merged address region, each child can compute and return results in variables arbitrarily scattered throughout the parent's address space, just as threads could do in conventional operating systems. Unlike threads in conventional systems, however, these computed results won't appear in the parent's address space at apparently random times dependent on the child thread's execution speed, but will instead appear in the parent only at the instant the parent explicitly requests that those results appear, by issuing the merge operation as part of the GET system call. The point at which the parent synchronizes with the child (namely the SYS_GET system call), and the point at which child synchronizes with the parent (due to a SYS_RET system call or an exception from user mode), are both deterministic results of the respective processes' local, single-threaded executions and independent of relative execution speeds, and the virtual memory merge operation is similarly deterministic in its behavior. For this reason, PIOS's equivalent of "threads" does not suffer from the nondeterminism of conventional threads, ensuring that the result of any computation, correct or not, can always be exactly reproduced. PIOS's deterministic "threads" do come with a potential performance cost due to additional virtual memory and copying operations, of course: most convenient, high-level abstractions that kernels provide come with some performance cost, and the question is then whether the abstraction's convenience justifies its cost. For more information about this somewhat unusual programming model, see this short draft paper on Deterministic Consistency.

Merge API

PIOS's API support for virtual memory merge consists of two operations briefly mentioned earlier: When performing a merge, the kernel need not necessarily compare every word of every page in the merged part of the child's address space with the corresponding word in the child's reference space. Since the child's reference space was merely a copy-on-write snapshot of the child's working space from some time before the merge, pages in the working space that the child never wrote to at all will still contain the same, read-shared page mappings as in the reference space. Thus, on such pages, the kernel merely needs to compare page table entries, and need not delve into actual page contents unless the working page has been modified (and thus un-shared) with respect to the original reference page.

Exercise 11. Implement SYS_SNAP and SYS_MERGE as described above. We have provided the skeleton functions pmap_merge() and pmap_mergepage() in kern/pmap.c to provide an outline for how to implement the merge function. The testvm program contains several parallel computations to exercise and test your merge implementation.

Hint: Keep in mind that SYS_SNAP is essentially just a pmap_copy(), and that SYS_MERGE requires the source and destination memory range to be aligned on 4MB boundaries, which should greatly simplify your page directory and page table traversal logic.

Challenge: The obvious way to merge a page of the child's space back into the parent at a word granularity is to iterate through the page a word at a time, reading and comparing each word in the child's working page with the corresponding word in the reference page, and if different, copying the word into the parent's page. This approach is not the most efficient, however, because (a) it merges only one word at a time, and (b) it involves branches, which limit the performance of modern processors by decreasing potential parallelism and introducing pipeline bubbles.

Intel's Streaming SIMD Extensions (SSE) instructions can be used to merge pages in 128-bit rather than word-sized chunks, and without using any branches other than the one that iterates the loop (and that branch is "mostly harmless" because it's highly predictable). Not only that, but using this approach, we can implement not just word-granularity merges but byte-granularity merges with no performance penalty. For example, if a parent process creates a char[] array, and forks one child process to compute even elements of the array and a second child to compute odd elements of the array, a word-granularity merge will yield incorrect results due to false conflicts between changes to bytes within a word, while a byte-granularity merge will provide the desired behavior.

Implement a byte-granularity, branch-free merge using SSE. To use SSE instructions at all, you will need to modify a control register or two during kernel initialization to enable the SSE extensions: see the IA-32 System Programming Guide for details. To implement the merge itself, some particular SSE instructions you will probably want to look at include PCMPEQB, PAND, PANDN, and POR. Use the processor's timestamp counters (RDTSC instruction) or some other timing mechanism to compare the performance of your SSE-based merge against the performance of a naïve word-by-word merge.

This completes the lab.