We have ourselves a basic kernel now. We can execute code, manipulate the contents of the screen, and really, that's about it --- because we can't control what memory we can access just yet. That will be the subject, and focus, of this set of notes: how can we manage and control access to memory?

Physical page management

Before we can start managing the virtual address space, we need to manage the physical memory available on the computer. Why? Because to use a new range of virtual memory, we need to have some physical memory to "back" it with; and so a prerequisite for virtual memory management is physical page management.

By "physical page management", I mean the ability to request a new previously-unallocated page, and the ability to return a page to the unallocated state. I'll refer to these actions as request and release.

In order to provide a new page upon a request, we need to know what pages are currently available for use. The obvious way to do this is to simply store a flat list of available pages; however, at 8 bytes per page address, this introduces a rather significant overhead: per GB of physical memory on the machine, we need 2MB to store the list of available pages. That's not unreasonable, per se, especially as the size of this list shrinks as more memory is used. However, we can do better.

Suppose that, instead of a flat list, we use a linked list. While this is actually worse in terms of overhead (we need two pointers per page now, so 4MB per GB of RAM!), it turns out we can be a little clever -- we can use the first couple bytes of each of the unused pages as where we store the next pointer. Ta-da! No memory overhead, constant-time request and release fulfillment, and not very complicated code. It's also possible, through a small amount of effort, to turn this into a wait-free data structure for when we get around to supporting multiple processors. Very useful!

There are some issues with this approach, though; for example, it's difficult to sort the pages into contiguous regions for when we want 2MB page support. 1 Unless you're careful, It also plays merry havoc with caching when we're requesting multiple pages within a short timespan. For the moment, though, it's simple enough that we'll run with it.

Physical memory map

There's a problem that we haven't really addressed yet. As was mentioned previously (and as you'll see in the next set of notes), several pieces of hardware use memory-mapped I/O. That is to say, certain ranges of the physical address space are actually used for communication, not storage. How do we know which address ranges are used for storage, and which for communication? We need a map of the physical address space.

Luckily, part of the Multiboot specification is that the bootloader must provide this information. The provided assembly wrapper provides a subset of this as the first argument to kmain(), which is a list of pairs of uint64_ts; the first of which is the address, and the second is the size. The wrapper will filter out the non-available portions of physical memory and only passes on those that are usable. Note that this will include the physical memory that the kernel is loaded to --- we'll need to figure out exactly where that is so that the physical memory manager doesn't accidentally allocate that for use elsewhere. Kernel code being overwritten is bad. Unless you're hotpatching.

You can get the starting and ending physical addresses of the kernel if you use the reference linker script by use of the symbols kernel_pbase and _data_phy_end.

Paging structures

Now, let's revisit the topic of how the paging data structures actually work. To restate the problem that we're trying to solve: we want each program to behave as if it was the only program running on the CPU, with an entirely separate memory address space. The way we do this is by setting up a mapping between the "virtual" address space of the program and the actual physical memory of the computer. Since a byte-by-byte mapping is rather overkill (and expensive to store!), memory is divided into 4KB chunks called pages, and instead a mapping from virtual pages to physical pages is provided, something like this:

Graph generated by dot

For the sake of conciseness, I'll represent this in the future like so: (i.e. the virtual pages with the physical page index inside)

Graph generated by dot

However, the number of pages is huge (about 236, or ~64 billion) and so a linear vector map of virtual page indices to physical page indices is simply not feasible. As such, a fixed-depth segment tree is used instead. That is to say, we turn the previous figure into something like this scenario instead:

Graph generated by dot

While at first this seems even worse -- not only do we have to store all the mapped elements, but we also have to store the rest of the tree structure! -- it has one major advantage: it doesn't have to be a complete tree. So we can remove internal nodes and hence mark entire regions of memory inaccessible at the same time, something like:

Graph generated by dot

When walking the tree, we can see at a much higher level that those pages are not mapped. This is excellent, as it caters to our current situation: the vast majority of the pages are not going to be mapped.

I mentioned above that it was a fixed-depth segment tree2. What kind of depth are we talking about, exactly? Depending on how you count "depth", but I call it 4. At the lowest level, we have pages (the square nodes on the trees above); above them we have page tables (circular nodes). Above those are page directories (triangles), page directory pointer tables (diamond), and finally there's also the PML4 ("Page Management Level 4") tables, not shown on the above trees.

You'll also see them referred to as PTEs, PDEs, PDPTEs, and PML4Es; this means "Page Table Entry" and so forth. I personally prefer this nomenclature, for a reason that will hopefully be clear in a moment.

Now that you hopefully can see how the page tables are organized, let's talk about how they're actually represented. Each level of the paging structure (PT, PD, PDPT, PML4) contains 512 8-byte elements (PTE/PDE/PDPTE/PML4E). Each element contains both a 64-bit pointer and a group of flags. This is possible because the smallest measure of memory is a 4KB page (which, conveniently, is the size of a paging structure -- 512 * 8 = 4096) , which means that for the target address, the lowest 12 bits are always zero (212 bytes = 4KB). So there is space for 12 flags.

However, to make things a little more complicated, there's also the concept of 2MB and 1GB pages. Essentially, since each page table stores 512 references to 4KB pages, the folks at Intel noted that sometimes people would want to, you know, use all 512 of those pages at the same time. Out comes the concept of a 2MB page (4KB * 512 = 2MB): instead of having a PDE point to a page table (to point to 512 different pages), you can have a PDE point to a region of memory to use as a 2MB page. The same thing applies to 1GB pages (2MB * 512 = 1GB) for PDPTEs.

So, I said there was space for 12 flags. Not all of that space is used, and a bit is actually stolen from elsewhere, which I think will lead to issues if we ever move to a full 64-bit virtual address space, but whatever. Here's the format of the various page table structures straight from the Intel SDM (Figure 4-11); note that M is the number of physical addressing bits, which is 39 on my processor but will vary between:

Figure 4-11

Let's go through these in order.

The rest of the bits are currently ignored. However, to maintain forwards compatibility with future changes, you should write back any values that you find there.

Finally, there's one point that we haven't touched on yet: how does the processor know where to find the PML4? The answer is that the physical address of the 4KB-aligned PML4 should be placed in the cr3 control register. This can be done with the following inline assembly helper functions:

uint64_t read_cr3() {
    uint64_t value;
    __asm__("mov rax, cr3" : "=a"(value));
    /* AT&T syntax: __asm__("mov %%cr3, %%rax" : "=a"(value)); */
    return value;
}

void write_cr3(uint64_t value) {
    __asm__("mov cr3, rax" : : "a"(value));
    /* AT&T syntax: __asm__("mov %%rax, %%cr3" : : "a"(value)); */
}

Clarifying example

Let's say that you have cr3 = 0x1000. This means that the page that contains the (single) PML4 is located at 0x1000. Suppose further that we want to map the virtual 4KB page 0xabcde000 to the physical address 0xfedcb000.

The PML4 governs a full 256 TB of RAM, so each entry (of which there are 512) represents 512GB. Since 0xabcde000 falls into the lowest 512GB of the virtual address space, it hence follows that the PDPT that controls 0xabcde000 is the first PDPT, i.e. the first PML4E. Each PD (or PDPTE) represents a single GB of RAM, and so because 0xabcde000 lies within the 3rd GB of RAM, the 3rd PDPTE (or PD) is the governor.

However, within the 3rd GB, 0xabcde000 is at an offset of 0x2bcde000, or slightly over 700MB in. That is to say, it is controlled by the 350th PDE (PT). Within the PT, the offset from the beginning of the governed range is 0xde000, and so -- because each page is 0x1000 bytes long -- it falls into the 222 (0xde)th entry.

Note that the index into each paging structure is determined by nine bits of the virtual address. So, to access the virtual address 0xabcde000, the paging structures are walked as follows:

This final PTE will now contain something along the lines of 0xfedcb003 --- marking it as being backed by the physical address 0xfedbc000, being both present and writable.

Suppose, for the moment, that this is the only physical page mapped into memory. Suppose further that while the PML4 is located at address 0x1000, we also have the single required PDPT at address 0x2000, the required PD at 0x3000, and the required PT at 0x4000; we then note that all values of the paging structures would be zero except for the following 8-byte values:

[0x1000] = 0x2003       (phy. addr 0x2000, present, writable)
[0x2018] = 0x3003       (phy. addr 0x3000, present, writable)
[0x3af0] = 0x4003       (phy. addr 0x4000, present, writable)
[0x46f0] = 0xfedcb003   (phy. addr 0xfedbc000, present, writable)

Because when you take the value 0xabcde000 and split it into the nine-bit-wide indexes, you get:

(0xabcde000 & 0xff8000000000) >> 39 = 0x000 = 0
(0xabcde000 & 0x007fc0000000) >> 30 = 0x002 = 2
(0xabcde000 & 0x00003fe00000) >> 21 = 0x15e = 350
(0xabcde000 & 0x0000001ff000) >> 12 = 0x0de = 222

Hence the indices into the various paging structures.

Paging structures provided by Multiboot wrapper

If you're using the Multiboot wrapper provided in the reference code repository, the following paging structures are set up:

Please note that to remove (or change) these mappings, you'll have to have support for 2MB pages in your virtual memory manager.

Page structure caching

As you can imagine, reading all of these paging structures can get expensive, introducing a not-so-unnoticeable overhead for the use of virtual memory. However, most problems in computer science can be solved via the application of a cache, and this happens to be one of them.

The CPU has the "translation-lookaside buffer" (TLB), which stores the results of previous paging-structure resolutions. In particular, it caches PDPTEs, PDEs, and PTEs. Unfortunately, no guarantees are made for consistency between the paging structures and the TLB; the kernel is responsible for that.

There are two methods of controlling the content of the TLB. The first is by reloading the content of cr3; doing so clears all the TLB contents. The second, more targeted, approach is the use of the invlpg instruction, which ensures that a single TLB element is cleared. This will be important later on, when we get to actually implementing a virtual memory manager.

Memory caching and write-combining behaviour

The three cache-control bits (bits PCD, PWT, and PAT) can be used in one of two ways: with the PAT, or without the PAT. Confusing, because this PAT is not the PAT bit in the paging data structures, it's a different table. We'll talk about the PAT here.

An important aspect of modern processors are the caches; these are used, as you would expect from the name, to locally mirror the content of memory for faster access times; x86_64 Intel systems are set up as follows with respect to caching. There are up to three levels of cache (L1, L2, and L3), plus the main memory; L1 caches are typically extremely small (32KB each for data and code), while L2 caches are usually larger (256KB or so). L3 caches, if present, are much larger (6MB). To give you an idea of the speed differences between the caches, on my personal laptop the access bandwidths are roughly 155 GB/s, 38GB/s, and 28GB/s to access the L1, L2, and L3 caches, respectively. Main memory has an access bandwidth of about 16GB/s.

Caches are divided into cache lines, which are essentially the unit of measure for data placed in the caches. 64 bytes is a common size for modern processors. The act of filling a cache line with contents from memory is called a cache line fill (or cache fill).

However, it's not only bandwidth that's important --- we also have to consider latency. This is where caches really shine; my laptop has RAM with CAS timings 9-9-9-24; this means that it is about a 26ns turnaround between realizing that the memory is desired and actually receiving the content. To put things in perspective, that is roughly 62 processor cycles. So the time taken to execute this code: (lfence ensures that all loads have finished and hence no out-of-order execution can follow until the read has completed)

mov     rax, dword [rsp + 8]
lfence
mov     rbx, rax

Would be about 64 cycles to execute if dword [rsp + 8] was in main memory. If it were in L1 cache instead, then it would be more like 3 cycles. That, as a professor of mine put it once, is significantly better.

Since memory access latencies are high -- at least, when compared to processor cycle times -- the concept of a cache is useful. However, there are some issues with cache coherency. In order to maintain the low latency, L1 and L2 caches are typically housed on-die on a per-processor basis. If you have two processors on a computer (and hence two different L1 caches) you can run into this problem: (assume memory contains all zeroes to start with)

Now main memory has just the values that Processor 0 wrote, because the cache line flush overwrites all 64 bytes. Obviously, this is problematic.

Cache coherency is solved on x86_64 systems by broadcasting writes across the processor bus. That is to say, as soon as Processor 1 writes into the cache, Processor 0 would receive a message saying that its contents have been invalidated, and so it would re-read the memory at that point. 4 As you can guess, this leads to inefficient behaviour when two processors are continuously writing to the same memory region.

There is also another caching-type mechanism present on x86_64 systems: the write combining buffer. This is, essentially, a way to cache writes so that multiple writes can be written out at once. This is, ultimately, only important due to the way SDRAM operates; while writing one address is as high as 26ns on my laptop, writing two consecutive addresses is only about 27ns. So combining adjacent writes has a big advantage; a similar advantage persists even with non-adjacent writes, where two non-adjacent writes on the same bank and row can be done in roughly 39ns.

There are six different caching behaviours that one can select between on x86_64. They are:

There are different ways to choose which caching behaviour is in use. The method I will describe, and the one I suggest you use, is the PAT (Page Attribute Table). With the PAT, the PAT bit, PWT bit, and PCD bit (LSB-first) describe an index into an eight-element table (the PAT). The table itself is stored in a 64-bit MSR (the IA32_PAT MSR) that has the following format (Intel SDM figure 11-9):

PAT register

The default values for PA0 through PA7 are

PA0:    WB  (6)
PA1:    WT  (4)
PA2:    UC- (7)
PA3:    UC  (0)
PA4:    WB  (6)
PA5:    WT  (4)
PA6:    UC- (7)
PA7:    UC  (0)

Note that there is, by default, no WP entry in the PAT. Since the two most common caching modes are UC and WT, you don't have to modify the PAT for the most part. Just use indices 1 and 3. For reference, here is the complete list of PAT entry types: (all values not specified here are reserved)

UC:  0
WC:  1
WT:  4
WP:  5
WB:  6
UC-: 7

Let's talk about how to actually write a virtual memory manager, then, shall we? Enough background information.

Writing a virtual memory manager

A virtual memory manager can really be as simple as some code that provides two operations: map(phy_addr, virt_addr, flags) and clear(virt_addr). Oftentimes you'll want a bit more code around this, but that's the core functionality that we'll need for the moment.

To implement the map() functionality, you'll need to build the paging structures appropriately. clear() is significantly easier. The trick, in my opinion, is setting up some helper functions to make these implementations easier.

The helper function I like the most is uint64_t *get_entry_for(address, level). This returns a pointer to the appropriate paging structure's table entry for the given index. It's used like this:

uint64_t *pml4e = get_entry_for(0xc001cafe, 3);
uint64_t *pdpte = get_entry_for(0xc001cafe, 2);
uint64_t *pde   = get_entry_for(0xc001cafe, 1);
uint64_t *pte   = get_entry_for(0xc001cafe, 0);

Note that it is possible to implement get_entry_for() recursively, though I personally prefer the iterative version. A sketch of the implementation:

uint64_t *get_entry_for(uint64_t address, int level) {
    uint64_t index = address / 0x1000;
    uint64_t indices[4] = {
        index % 512,
        (index >> 9) % 512,
        (index >> 18) % 512,
        (index >> 27) % 512
    };
    uint64_t *latest = (uint64_t *)cr3_contents;
    for(int l = 3; l >= level; l --) {
        uint64_t *followed = (uint64_t *)(*latest & ~0xfff);
        // if the present bit (P) is clear and we have more to go, then error.
        if(l != level && (*latest & 1) == 0) {
            return NULL;
        }
        // TODO: check for size bit here
        latest = followed + indices[l];
    }
    return latest;
}

Please note that this does not work directly, as the paging structures will not be mapped into memory at their canonical addresses. Exactly how to modify this so it works in your kernel will depend on the design you've chosen.

There are a couple of decisions that you'll need to make in order to implement your virtual memory manager:

I suggest that, for the moment, you go the simplest possible route and then make things more complicated later. Leave yourself wiggle room in your API so you can do this properly.

Heaps

Well, now that we have the ability to go and back arbitrary virtual addresses with physical memory, let's do something with that. Shall we write a heap? No, not the data structures binary-heap kind, the malloc/free kind.

As you probably know, the heap is where you put long-lived objects that should exist outside the lifespan of a function invocation. It's extremely useful, and programming without a heap feels like trying to program with only one hand. Yes, you can do it. No, it's neither fun nor quick.

So, let's start off with the basics. Where do we put the heap, for one? The answer is pretty simple: wherever you like! You're writing the kernel, you have full control over where things go. I suggest picking a high memory address like 0xffffc00000000000 as the kernel heap starting location, but you are of course free to choose whatever you like. Just, please, don't put it at address zero. NULL being a valid pointer makes my head hurt sometimes.

Allocators

There are two types of allocators that I'd probably suggest you consider. To start with, the "watermark" allocator is probably a good one; it can be summed up as "there is no free." That is to say, to allocate something, you just grab the next chunk of memory after the previous allocation. To free something? Well, you don't.

The watermark allocator is horrible and you should never use it in anything past testing purposes. But, really, it's good enough for the moment, and I'd suggest implementing it and moving on, then revisiting this topic in a couple of weeks when a good allocator turns out to be necessary.

The other allocator I'll recommend for the moment is the pool allocator. The basic idea is to reuse the same approach that was outlined above for the physical page manager. Create a list of pools, such that each pool can satisfy malloc requests for a particular element size. Then we can simply resize each pool by adding more memory whenever a pool runs out.

The GDT

The last memory-related topic to talk about for now is that of the GDT, or Global Descriptor Table. This is a very small part of a much larger topic, that of protection and segmentation; we'll only touch on the bare essentials for now, because we won't care about most of this until we get around to implementing userspace.

Essentially, there are different ways of looking at the contents of memory: it can be considered as a source of code, as a stack, or as data. The distinction these days (especially on x86_64) is not very important, as we have a very wide address bus. However, in the days of yore (i.e. the 8086), one had a 20-bit address bus but only 16-bit registers. The problem was partially solved with the concept of a segment: essentially, address 0 on one segment is not necessarily the same as address 0 on another segment. They also had limits, i.e. enforced maximum addresses. This was a precursor to paging, allowing limited viewpoints onto physical memory.

These days, however, except for two special cases (fs and gs, discussed when we get to SMP) segmentation has no base offsets or limit checking. "What use are segments, then?" you might ask. "Not much" is the answer. However, segments are used for a couple of other things (e.g. the TSS, see next notes) and mostly have to be kept around for backwards compatibility.

Segmentation on real-mode x86 works differently than long-mode x86_64; we'll detail the latter and leave the former up to you to read about if you're curious. The GDT stores a list of segment descriptors (hence the name "descriptor table"); descriptors have a couple of bits present, such as the type (code or data?), the "size" (32-bit or 64-bit?), and privilege level (0-3, which we'll talk about when we get to userspace). The segment registers (cs through ss, skipping a few) are loaded with offsets within the GDT. So the first descriptor would be at offset 0x0, the second at 0x8, and the sixth at 0x28.

The provided Multiboot wrapper sets up a provisional GDT, with three segments: the required NULL segment (the very first element in the GDT has to be an invalid "NULL" segment, to make a segment register loaded with 0x0 invalid), a code segment (at offset 0x8), and a data segment (at offset 0x10, of course). The format of segment descriptor entries has to be one of the weirdest formats ever. Instead of something sensible, we get these, for code and data respectively: (AMD APM Volume 2 Figures 4-20 and 4-21)

64-bit code descriptor entry

64-bit data descriptor entry

Greyed-out entries are unused. Yes, you're reading those correctly --- there's only one useful bit on the data segment descriptor, and six bits on the code descriptor, according to the figures. Let's look at these bits in a little more detail.

The P bit is the Present bit. This should be set in order to use the segment. The two DPL bits should both be cleared right now (until we get to talking about protection rings and userspace). The D bit should be clear for 64-bit code. The C bit (conforming bit) is another we'll talk about later, and right now can be either. The L bit determines whether the code in this segment should be interpreted as 32-bit or 64-bit code; set it as we'd like 64-bit code.

An important thing to note: the AMD manuals specify no value for the W bit, and it suggests that all greyed-out values should be zero. The Intel manuals, however, specify that the W bit should be set for a writable data segment, or clear for a read-only data segment. Since we'll be using paging to determine writability, you should set the W bit on your selectors.

Finally, we need to talk about how to tell the processor about the GDT; in particular, the virtual address it's located at and the size of the table (in bytes). The processor uses the lgdt instruction to do this. However, rather than being sensible, the lgdt instruction takes a pointer to a 10-byte region of memory. The last eight bytes of this are a pointer to the GDT, and the first two bytes are the length in bytes of the GDT, minus one. Here's some code with inline assembly that does that, for clarification:

void load_gdt(uint64_t address, uint64_t addressable) {
    uint8_t ptr[10];
    *(uint16_t *)(ptr + 0) = addressable - 1;
    *(uint64_t *)(ptr + 2) = address;
    __asm__("lgdt [rax]" : : "a"(ptr));
}

Debugging a virtual memory manager

One thing that's worth noting at this stage is that debugging a virtual memory manager can be extremely difficult. I recommend that you familiarize yourself with your VM of choice's debugging faculties, and you may also want to make use of the serial port for debugging, as its contents can often be redirected to a file or some other location readable after a reset. As an example, see dserial.c in the refcode repo.

Further reading

Suggested work

At this point, you should know (roughly speaking) the important parts of how x86_64 virtual memory operates.

What I'd suggest working on before the next set of notes is to implement a simple physical page allocator, and a virtual memory mapper (i.e. the map() and clear() functionality from earlier), in addition to a simple GDT manager that lets you add new segments to the GDT as you want.

I would also recommend implementing a simple heap using your virtual memory manager. A watermark allocator is fine for now, but you'll want a pool allocator or something else similar in the long run.


  1. I suggest the use of a modified skiplist and representative page system for solving this particular problem. It's a little complicated (and is Generated by LaTeX instead of constant time) and so I'll discuss it later if we have time. 

  2. It's kind of like a segment tree, but not quite, because the leaf nodes are actually indices instead of ranges. However, to me it feels like a segment tree because each internal node represents the root of a subtree containing elements within a particular range, as opposed to a binary search tree where an internal node actually stores data. 

  3. Actually, the dirty bit is set upon the first write to the memory contents, even if the already-stored value is written back to the same location. 

  4. Aha, you might say, but what if two processors execute a write simultaneously? Yes, yes, fine. The actual mechanism is more complicated than this. But this is a good enough approximation; the actual method used is speculative consideration of cache lines filled on other processors and write intentions. See Chapter 11 of the Intel SDM for details.