A brief note before beginning: I use Linux for all my OSDev work. While it is entirely possible to develop on Windows, I will not provide any Windows support. If you are on Windows and get things set up, let me know and I will add the appropriate information to these notes.

These notes cover off the basic required background information: some details about the Intel 64 architecture (referred to as x86_64 or IA-32e), how to write Intel assembly, the use of emulators to simulate your own kernel, and other various useful stuff.

Intel CPU model

First, some notes on the general structure of the Intel CPU model.

General-purpose registers

x86_64-compatible CPUs have 16 64-bit general-purpose registers, in addition to various floating-point and SIMD registers. The general-purpose registers are rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8, r9, r10, r11, r12, r13, r14, and r15. Some registers also have particular roles that they are normally used for; examples:

Most registers can be accessed, in a limited fashion, as 8-, 16-, and 32-bit registers as well. For example, rax can be accessed as:

The same methods of access are available for rbx, rcx, and rdx. The other registers have 32-bit accesses in the form of esi etc. and r8d etc; 16-bit accesses as si and r8w; and finally, 8-bit accesses as sil and r8l.

There is no way to, for example, access bits 32-63 of rax as a 32-bit register directly. In practice, this does not cause many issues, though it would be useful for optimization purposes.

Special registers

A vitally important register that is not directly accessible is the rflags register. This contains much information about the current processor status, in addition to storing comparison condition codes (or status flags). The condition codes/status flags available on x86_64 are:

The use of these flags is fairly self-explanatory; but you should consult the processor documentation for exact details as to how they are interpreted in different instructions. Another useful flag is DF, the direction flag, which controls the direction some instructions operate in.

Also available is the rip register, which contains the current instruction pointer. It cannot be written to, but can be used as a value in some address arithmetic.

Control registers

x86_64 Intel processors also have six control registers, cr0 through cr4 and cr8. The contents of these registers control are mostly mode bits that change how the processor acts, and what CPU features are enabled at any point in time. We'll touch on these as we talk more about the CPU features themselves.

Model-specific registers (MSRs)

Intel processors also have registers known as Model-Specific Registers, which are accessed by means of the wrmsr and rdmsr instructions. These typically contain configuration information about different processor features, à la the control registers. However, while there are a limited number of control registers due to instruction encoding difficulties, MSRs are accessed by means of a 32-bit index.

A reference of all known Intel MSRs can be found in the Intel Software Developer's Manual.

Intel memory model

The 32-bit x86 architecture has a slightly complicated memory model, as (unlike many other architectures) it is segmented. However, much of the complexity has been removed for x86_64 (in particular, segment limits are no more). Privilege checking is still in place, and we'll visit this topic when talking about hardware protection mechanisms in a little while.

For now, though, the important details: the x86_64 architecture uses a flat, byte-addressable 48-bit virtual address space. Why 48-bit? Well, it's actually 64-bits wide (the value addresses range from 0 to Generated by LaTeX), but because 64 bits of addressing is something called "overkill", the 16 MSB all have to contain the same value.

This is a little odd, because it means the address after 0xffffffffffff is actually 0xffff000000000000, but it does leave room for further expansion to a true 64-bit virtual address space in the future.

Memory is primarily accessed in one of four different widths: either a single byte, a "word" (two bytes), a "dword" (four bytes), a "qword" (eight bytes).1 Alignment is not required; one can access a qword starting from address 0x31. However, only aligned memory accesses are atomic. Concurrency properties other than atomicity can vary based on the address from tables set up by the operating system.

There is also the concept of a "segment" in x86_64. This is, essentially, an address offset. The two interesting segments in x86_64 are gs and fs; accessing gs:[0x0] will not necessarily access address zero. We'll revisit this topic when we talk about per-cpu kernel data structures in the SMP section.

Intel-syntax assembly

There are two main syntaxes that one can write x86_64 assembly in. There is the syntax used by GNU as, or AT&T syntax; there is also the syntax used by nasm (and various other assemblers), usually called Intel syntax. Intel syntax is a little more tied to, well, the Intel x86 platform, whereas AT&T syntax is also used on other platforms. I personally greatly prefer Intel syntax, so it is what I shall be using in these notes. Feel free to translate if you feel like it and already are comfortable with AT&T syntax.

Registers are accessed as rax, dl, r9, and so forth. Constants are simply the numbers with no prefix; decimal and hexadecimal are both supported. (Some assemblers also support binary and octal inputs.) x86 does not required the use of special memory load/store instructions to manipulate memory; instead, memory accesses can be used in the same way as constants or registers. The notation for such is [address]; the assembler will attempt to determine the size of dereference to make based on the rest of the instruction, but you may need to specify the size explicitly. In those cases, the syntax is size [address].

All the usual instructions are present: mov, add, sub, and, or, not, and so forth. Condition codes are set after each instruction; and Intel uses (for the most part) two-operand instructions. So add rax, rbx will set rax = rax + rbx. For those used to three-operand architectures like SPARC, this may take some getting used to.

Some example code to maybe clear up the formatting a little:

    push    rbp
    sub     rax, rdi
    mov     rbp, rsp
    sar     rax, 0x3
    mov     rdx, rax
    shr     rdx, 0x3f
    add     rax, qword [rbp + 0x10]
    sar     rax, 1
    jc      .label

A more complete example is the following code, which is a stupidly-inefficient recursive Fibonacci number calculator:

fib:
    ; if parameter <= 1, then return 1
    cmp     rdi, 1
    jg      .rec

    ; return values go in rax
    mov     rax, 1
    ret
.rec:
    ; subtract one from parameter
    dec     rdi
    ; save parameter for later, will be overwritten by recursive call
    push    rdi

    ; recursively-calculate fib(n-1)
    call    fib

    ; grab (n-1) and put it back in rdi
    pop     rdi
    ; calculate (n-2)
    dec     rdi
    ; save the return value of fib(n-1) on the stack
    push    rax

    ; recursively calculate fib(n-2)
    call    fib

    ; put fib(n-1) into rbx
    pop     rbx
    ; add rbx and rax to calculate fib(n) = fib(n-1) + fib(n-2)
    add     rax, rbx

    ret

Intel reference manuals

To quote Michael Abrash,

Most assembly language programmers don’t bother to read Intel’s manuals (which are extremely informative and well done, but only slightly more fun to read than the phone book), and go right on programming . . .

In all seriousness, the Intel reference manuals are quite well done. Electronic copies are freely available from Intel's website (the URL changes now and then; use your favourite search engine to look for "intel software developer manuals"), and you can get a hardcopy published from Lulu for a reasonable price. I, personally, use both -- the hardcopy for initial reading and understanding, and softcopy for reference. Unless I specify otherwise, when I make a reference to the Intel SDM, I'm referring to Volume 3.

Emulators

Since improperly interfacing with hardware can, in rare circumstances, cause physical damage, we will be using a system emulator. (A handy side-effect is that the development cycle is much quicker.) There are two good choices available for operating system development: Bochs and qemu. qemu tends to have a much higher execution speed than Bochs, but Bochs has better built-in debugging facilities. Until you get your own kernel debugging infrastructure up and running, it may be worth sticking with Bochs.

Ultimately, I recommend using both on a regular basis -- it is oft the case that you reach a corner-case that is handled differently in each emulator, and fixing it early can be a great help.

Bochs

I highly suggest a custom-compiled version of Bochs, as the ones that are present in various Linux distributions sometimes have issues. In the past I have used Bochs 2.6, but any more recent version is probably perfectly fine.

My own configuration line is:

 ./configure --enable-x86-64 --enable-smp --enable-all-optimizations \
    --enable-pci

You will want to create an appropriate .bochsrc file with the configuration for your OS; however, the file format is self-explanatory and it should not be difficult to get set up.

qemu

I tend to prefer the use of qemu over Bochs for the simple reason that messing about with creating a CD image with GRUB2 on it is not required, as qemu can boot multiboot-compatible kernels directly, via running qemu-system-x86_64 -kernel kernel.bin.

Compiling qemu from source does not tend to be required, though you will want a version that provides dynamic translation for 64-bit Intel architectures. In particular, you probably want the qemu-system-x86_64 binary.

Generating a bootable CD image

One convenient way to get your operating system into an emulator is by putting it onto a CD image, and then getting the emulator to boot off of the image. Unfortunately, since we will be producing multiboot kernels without a bootloader, we need a bootloader stage, first.

One common bootloader to use is GRUB, or the GRand Unified Bootloader. Its main advantages, as far as we are concerned, are portability and modularity. An added advantage is that GRUB is already used by the overwhelming majority of Linux installations on IA-32e, allowing easy co-existence.

To create the CD image, you will need an ISO creation program. Personally, I use mkisofs, part of the cdrkit package on Arch Linux. I have also used xorriso successfully in the past.

There are three parts to the ISO image: the bootloader, the kernel image, and whatever other support files you would like to include. The mkisofs arguments I use are:

mkisofs -quiet -R \
    -no-emul-boot -boot-load-size 4 -boot-info-table \
    -A "sydi" \
    -b boot/eltorito.img \
    -o sydi.iso \
    -graft-points kernel.bin=kernel/kernel.bin fs

This generates an image called sydi.iso, with the file boot/eltorito.img going into the boot sector; adding the file kernel/kernel.bin as kernel.bin on the root of the CD image, and including all other files in the fs/ directory in the root.

If you are using another ISO creation program, I suggest you find similar arguments.

OS design theory

Almost all operating systems have the concept of a userspace/kernelspace divide. That is to say, there are particular operations that are privileged and happen in the kernel ("kernelspace" operations), and other operations that are initiated by the programs running on the OS ("userspace").

Usually, this is done as a protection mechanism. This prevents userspace programs from interacting with each other (and the hardware!) except in carefully-controlled manners. x86_64 provides hardware support for this kind of protection. We'll revisit this in more detail later.

There are, of course, different types of kernels. Most "classic" kernels (Linux, BSD, early Windows versions, debatably later Windows versions) follow the "monolithic" design principle; here, the general idea is that everything in kernelspace lives inside the same virtual address space. That is to say, everything in kernelspace can read and/or write the same memory. This provides greater efficiency (communicating within the kernel is cheap!) and simplicity (common data structure code and heap memory); the downside is that it is potentially more susceptible to crashes. After all, a rogue bit of code can completely wipe out the rest of the kernel's state.

Quite a few more modern kernels (QNX, Mach, HURD, to name a couple) follow a different design principle: that of the "microkernel". The idea of a microkernel is to separate out the different parts of kernelspace into different virtual address spaces, to prevent parts of the kernel from interfering with each other in exactly the same way that the kernel prevents user programs from interfering with each other. (Typically this is implemented using exactly the same mechanisms.) The advantage is stability: for a properly-implemented microkernel, it is impossible for one rogue module to destroy the state of any other parts of the kernel. The disadvantages are, of course, complexity and speed; microkernels typically have extremely high IPC costs.

You'll sometimes see the terms "minikernel" and "nanokernel" thrown around; they're intended to describe different points on the scale of separation (minikernels are between micro and monolithic, and nanokernels take the separation extremely seriously.) These are really just semantic sugar, though, and in my opinion the distinction isn't very meaningful.

Further reading

Closing thoughts

These notes should give you a broad idea of how the Intel architecture is set up, in addition to a couple of points on how we're going to compile and run our kernels. The next set of notes, which will be up soon, will focus more on actually getting a simple kernel up and running.

Until then, happy hacking!

- ethereal


  1. There are also 16-byte accesses for XMM registers, 32-byte accesses for the AVX YMM registers, and debatably a 10-byte access for the lidt, lgdt, and lldt instructions; but we're not interested in these at this very moment.