This is the third article in a multi-article series on Linux program loading. I originally thought it would be a three-article series, but it now seems like I may have enough material for as many as five. To keep things nice and ambiguous, I will call it a 'multi-part' article.

The first article was primarily background information. The second article talked about statically-linked executables, and how to load them. Now, we'll talk about the initial program environment.

What do we mean, initial program environment?

Well, it's a good question. If we're describing something, it helps to outline what we want to describe in the first place . . . So: by 'initial program environment' we mean the following:

Throughout this article, there will be two main references:

If a statement has no associated reference, assume that it is from one of these two references. If it isn't, let me know and I'll back it up properly.

Initial contents of registers

So, let's start with the registers. This one is easy: it's specified in section 3.4 of the ABI spec.

That was simple, right?

Initial contents of memory

Now let's talk about the contents of memory. We need to consider two things: the contents of the initial stack, and the contents of global memory that the kernel is kind enough to set up for us.

Let's start with the stack, because it has simpler, and more applicable, contents.

Stack diagram

The stack contains several useful pieces of information. In order of high addresses to low addresses, these are:

  1. Auxiliary vector.
  2. Environment variables.
  3. Program arguments.

We'll discuss these in reverse order.

Program arguments

The program arguments are passed in on the initial stack, as you might expect. In particular, the very first stack address ([rsp + 0x0]) will have the value of argc, and [rsp + 0x8] through [rsp + 0x8*argc + 0x8] will be the values for argv -- argc is eight bytes wide on 64-bit platforms.

For example, suppose we invoke a program ./program arg1 2. Then the stack will look like:

rsp + 0x20: pointer to NULL
rsp + 0x18: pointer to string "2"
rsp + 0x10: pointer to string "arg1"
rsp + 0x08: pointer to string "./program"
rsp + 0x00: 3

The strings themselves will also be on the stack, but above the auxiliary vector in terms of where they will be placed.

Environment variables

Immediately after the NULL terminator of argv comes the next region of interest: that of the environment variable strings. These are what you'd expect, those values you find in the output from env or set by use of a NAME=value command in a shell.

These are stored as strings of the form NAME=value, and pointers to these are on the stack; as with argv, these are terminated by a NULL pointer. Let's say we have two environment variables, A=abc and B=bcd. Then the stack would look something like:

rsp + envp_offset + 0x10: pointer to NULL
rsp + envp_offset + 0x08: pointer to string "A=abc"
rsp + envp_offset + 0x00: pointer to string "B=bcd"

Here, envp_offset is the size of argv plus the size of argc. As with the program arguments, the strings themselves will also be on the stack, but above the auxiliary vector.

Environment variables are useful, but not terribly interesting from the point of view of this article. Interesting things start when we consider the auxiliary vector.

Auxiliary vector

The auxiliary vector stores, well, auxiliary information related to a process. It's information that doesn't strictly belong in the environment variables, but is still required. Unlike the program arguments and environment variables -- both of which are passed, however indirectly, from userspace -- the auxiliary vector is from values strictly generated from the kernel.

How is this useful? Well, it's not the sort of thing that most userspace programs/programmers ever have to worry about. But it does play a fairly crucial role in low-level userspace programming, which is where we're living at the moment.

Entries in the auxiliary vector are pairs (type, value) with value being an unsigned 64-bit integer. Strictly speaking, they're actually instances of the structure:

typedef struct {
    long a_type;
    union {
        long a_val;
        void *a_ptr;
        void (*a_fnc)();
    } a_un;
} auxv_t;

Just think of the values as unsigned integers, if it helps. It might make things a little bit simpler.

So what sorts of things can we expect to see in the auxiliary vector? This is where things start to get interesting . . . in short, there are a few different specifications for the values, and they don't all agree. Since what we're really interested in is kernel behaviour, we'll go with what the kernel claims the values should be.

The types are as follows:

To recap: the initial stack has three main usable regions. First at low addresses, the program arguments, followed by the environment variables. Finally at the higher addresses, the auxiliary vector, which is made up of pairs of words (64 bits on 64-bit Linuxes, 32 bits on 32-bit Linuxes). The contents of the auxiliary vector might be a little clearer with an example:

rsp + auxv_offset + 0x28: NULL terminating value (0)
rsp + auxv_offset + 0x20: NULL terminating type (0)
rsp + auxv_offset + 0x18: AT_CLKTCK value (100)
rsp + auxv_offset + 0x10: AT_CLKTCK type (17)
rsp + auxv_offset + 0x08: AT_PHENT value (56)
rsp + auxv_offset + 0x00: AT_PHENT type (4)

So, how about some example values? Here's a quick program to dump the contents of the various auxiliary vector entries:

#include <stdio.h>
#include <stdint.h>

/// stolen from include/uapi/linux/auxvec.h in kernel sources
#define AT_NULL 0
#define AT_IGNORE 1
#define AT_EXECFD 2
#define AT_PHDR 3
#define AT_PHENT 4
#define AT_PHNUM 5
#define AT_PAGESZ 6
#define AT_BASE 7
#define AT_FLAGS 8
#define AT_ENTRY 9
#define AT_NOTELF 10
#define AT_UID 11
#define AT_EUID 12
#define AT_GID 13
#define AT_EGID 14
#define AT_PLATFORM 15
#define AT_HWCAP 16
#define AT_CLKTCK 17
/* AT_* values 18 through 22 are reserved */
#define AT_SECURE 23
#define AT_BASE_PLATFORM 24
#define AT_RANDOM 25
#define AT_EXECFN 31
/// stolen from arch/x86/include/asm/auxvec.h
#define AT_SYSINFO  32
#define AT_SYSINFO_EHDR 33

uint64_t *find_auxv(void *argv) {
    uint64_t *ptr = (uint64_t *)argv;
    printf("argv starts at %p\n", ptr);
    // skip argv
    while(*ptr != 0) ptr ++;
    // skip argv terminator
    ptr ++;
    printf("envp starts at %p\n", ptr);
    // skip envp
    while(*ptr != 0) ptr ++;
    // skip envp terminator
    ptr ++;
    // in auxv!
    printf("auxv starts at %p\n", ptr);
    return ptr;
}

int main(int __attribute__((unused)) argc, char *argv[]) {
    uint64_t *auxv = find_auxv(argv);

    if(auxv == NULL) {
        fprintf(stderr, "Could not find auxv.\n");
        return 1;
    }

    int count = 0;
    while(auxv[count*2]) count ++;

    printf("There are %i entries in the auxiliary vector.\n", count+1);

    for(int i = 0; i <= count; i ++) {
        printf("\tEntry %i: ", i);

        uint64_t type = auxv[i*2];
        uint64_t value = auxv[i*2+1];

        switch(type) {
        case AT_NULL:
            printf("NULL terminator (%lu)\n", type);
            break;
        case AT_IGNORE:
            printf("Ignored (%lu)\n", type);
            break;
        case AT_EXECFD:
            printf("Executable FD (%lu)\n", type);
            printf("\t\twhich is %lu\n", value);
            break;
        case AT_PHDR:
            printf("Program headers (%lu)\n", type);
            printf("\t\twhich are at 0x%lx\n", value);
            break;
        case AT_PHENT:
            printf("Program header entry size (%lu)\n", type);
            printf("\t\tand they are %lu bytes each\n", value);
            break;
        case AT_PHNUM:
            printf("Program header count (%lu)\n", type);
            printf("\t\tthere are %lu\n", value);
            break;
        case AT_PAGESZ:
            printf("Page size (%lu)\n", type);
            printf("\t\tand they are %lu bytes each\n", value);
            break;
        case AT_BASE:
            printf("Interpreter base address (%lu)\n", type);
            printf("\t\twhich is 0x%lx\n", value);
            break;
        case AT_FLAGS:
            printf("CPU flags (%lu)\n", type);
            printf("\t\tand they are 0x%lx\n", value);
            break;
        case AT_ENTRY:
            printf("Entry point (%lu)\n", type);
            printf("\t\twhich is 0x%lx\n", value);
            break;
        case AT_NOTELF:
            printf("Not ELF executable (%lu)\n", type);
            break;
        case AT_UID:
            printf("UID (%lu)\n", type);
            printf("\t\twhich is %lu\n", value);
            break;
        case AT_EUID:
            printf("Effective UID (%lu)\n", type);
            printf("\t\twhich is %lu\n", value);
            break;
        case AT_GID:
            printf("GID (%lu)\n", type);
            printf("\t\twhich is %lu\n", value);
            break;
        case AT_EGID:
            printf("Effective GID (%lu)\n", type);
            printf("\t\twhich is %lu\n", value);
            break;
        case AT_PLATFORM:
            printf("Platform ID (%lu)\n", type);
            printf("\t\twhich is \"%s\"\n", (char *)value);
            break;
        case AT_HWCAP:
            printf("Hardware capabilities (%lu)\n", type);
            printf("\t\tand they are 0x%lx\n", value);
            break;
        case AT_CLKTCK:
            printf("Clock ticks per second (%lu)\n", type);
            printf("\t\tof which there are %lu\n", value);
            break;
        case AT_SECURE:
            printf("Secure flag (%lu)\n", type);
            printf("\t\twhich is 0x%lx\n", value);
            break;
        case AT_BASE_PLATFORM:
            printf("Base platform ID (%lu)\n", type);
            printf("\t\twhich is \"%s\"\n", (char *)value);
            break;
        case AT_RANDOM:
            printf("Address of random bytes (%lu)\n", type);
            printf("\t\twhich is 0x%lx\n", value);
            printf("\t\tand they are ");
            for(int i = 0; i < 16; i ++) printf("%x", ((uint8_t *)value)[i]);
            printf("\n");
            break;
        case AT_EXECFN:
            printf("Executable filename address (%lu)\n", type);
            printf("\t\twhich is 0x%lx\n", value);
            printf("\t\tand is \"%s\"\n", (char *)value);
            break;
        case AT_SYSINFO:
            printf("VDSO call address (%lu)\n", type);
            printf("\t\twhich is 0x%lx\n", value);
            break;
        case AT_SYSINFO_EHDR:
            printf("VDSO ELF header address (%lu)\n", type);
            printf("\t\twhich is 0x%lx\n", value);
            break;
        default:
            printf("Unknown (%lu)\n", type);
            break;
        }
    }

    return 0;
}

And here's the output from an example run:

argv starts at 0x7fff8f391198
envp starts at 0x7fff8f3911a8
auxv starts at 0x7fff8f3912b8
There are 19 entries in the auxiliary vector.
    Entry 0: VDSO ELF header address (33)
        which is 0x7fff8f3fe000
    Entry 1: Hardware capabilities (16)
        and they are 0xbfebfbff
    Entry 2: Page size (6)
        and they are 4096 bytes each
    Entry 3: Clock ticks per second (17)
        of which there are 100
    Entry 4: Program headers (3)
        which are at 0x400040
    Entry 5: Program header entry size (4)
        and they are 56 bytes each
    Entry 6: Program header count (5)
        there are 8
    Entry 7: Interpreter base address (7)
        which is 0x7fab9ff9a000
    Entry 8: CPU flags (8)
        and they are 0x0
    Entry 9: Entry point (9)
        which is 0x4004e0
    Entry 10: UID (11)
        which is 1024
    Entry 11: Effective UID (12)
        which is 1024
    Entry 12: GID (13)
        which is 1024
    Entry 13: Effective GID (14)
        which is 1024
    Entry 14: Secure flag (23)
        which is 0x0
    Entry 15: Address of random bytes (25)
        which is 0x7fff8f3913e9
        and they are aab2a23a5ae5b1d20e3f1a94c22f46
    Entry 16: Executable filename address (31)
        which is 0x7fff8f393fec
        and is "./dump_auxv"
    Entry 17: Platform ID (15)
        and they are 0x7fff8f3913f9
    Entry 18: NULL terminator (0)

On Linux, the auxiliary vector is generated in the file fs/binfmt_elf.c. The following is the relevant part, starting at line 227 in the Linux 3.7 kernel source:

#define NEW_AUX_ENT(id, val) \ 
    do { \ 
        elf_info[ei_index++] = id; \ 
        elf_info[ei_index++] = val; \ 
    } while (0)

#ifdef ARCH_DLINFO
    /* 
     * ARCH_DLINFO must come first so PPC can do its special alignment of
     * AUXV.
     * update AT_VECTOR_SIZE_ARCH if the number of NEW_AUX_ENT() in
     * ARCH_DLINFO changes
     */
    ARCH_DLINFO;
#endif
    NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
    NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE);
    NEW_AUX_ENT(AT_CLKTCK, CLOCKS_PER_SEC);
    NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
    NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
    NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
    NEW_AUX_ENT(AT_BASE, interp_load_addr);
    NEW_AUX_ENT(AT_FLAGS, 0);
    NEW_AUX_ENT(AT_ENTRY, exec->e_entry);
    NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid));
    NEW_AUX_ENT(AT_EUID, from_kuid_munged(cred->user_ns, cred->euid));
    NEW_AUX_ENT(AT_GID, from_kgid_munged(cred->user_ns, cred->gid));
    NEW_AUX_ENT(AT_EGID, from_kgid_munged(cred->user_ns, cred->egid));
    NEW_AUX_ENT(AT_SECURE, security_bprm_secureexec(bprm));
    NEW_AUX_ENT(AT_RANDOM, (elf_addr_t)(unsigned long)u_rand_bytes);
    NEW_AUX_ENT(AT_EXECFN, bprm->exec);
    if (k_platform) {
        NEW_AUX_ENT(AT_PLATFORM,
                (elf_addr_t)(unsigned long)u_platform);
    }
    if (k_base_platform) {
        NEW_AUX_ENT(AT_BASE_PLATFORM,
                (elf_addr_t)(unsigned long)u_base_platform);
    }
    if (bprm->interp_flags & BINPRM_FLAGS_EXECFD) {
        NEW_AUX_ENT(AT_EXECFD, bprm->interp_data);
    }
#undef NEW_AUX_ENT

For reference, on the 64-bit x86 arch (i.e. 64-bit Intel architecture), ARCH_DLINFO expands out to the following (from arch/x86/include/asm/elf.h):

#define ARCH_DLINFO \ 
do { \ 
    if (vdso_enabled) \ 
        NEW_AUX_ENT(AT_SYSINFO_EHDR, \ 
                (unsigned long)current->mm->context.vdso); \ 
} while (0)

While, on 32-bit, it expands to this instead:

#define ARCH_DLINFO_IA32(vdso_enabled) \ 
do { \ 
    if (vdso_enabled) { \ 
        NEW_AUX_ENT(AT_SYSINFO, VDSO_ENTRY); \ 
        NEW_AUX_ENT(AT_SYSINFO_EHDR, VDSO_CURRENT_BASE); \ 
    } \ 
} while (0)

TODO: what happens when values are incorrect?

VDSO

Now let us turn to global memory. The kernel will provide a memory region called the 'VDSO', which was mentioned earlier as the point of two entries in the auxiliary vector.

Remember back in the first article that when discussing how a userspace program makes a system call, there were multiple ways to invoke a system call? The VDSO is one method that the kernel provides a way to abstract over what actual hardware method is used for invoking a system call. Instead of using a syscall/sysenter instruction directly, instead we call a function present in the VDSO that will contain this instruction.

The reason for this is that now, the kernel can swap out VDSOs and userspace will be none the wiser. Should AMD or Intel introduce some amazing new instruction that will allow system calls to run 100% faster, all that you'd have to do is get the kernel to implement this new instruction in a new VDSO, and well-behaved programs will all magically work.

One might object to the additional overhead introduced by making an additional function call, but in all honesty, this cost will be dwarfed by that of the kernelspace privilege (CPL, IOPL, etc.) change. I've never benchmarked this, to be honest, so I don't actually know how much of an impact there would be. I imagine the cost of the syscall's execution would also dominate.

vsyscall

vsyscall is essentially a fixed-location version of the VDSO. It predates the VDSO, for reference. It's placed at address 0xffffffffff600000 (actually the same as 0xffffff600000 due to the aforementioned quirk in the x86_64 architecture) and contains much the same information. It's only provided as backwards compatibility for programs written for versions of Linux that didn't have VDSO support yet.

More technically, the address 0xffffffffff600000 is the address of the first vsyscall. The address for the nth such vsyscall is determined by the following macros (from arch/x86/include/asm/vsyscall.h):

#define VSYSCALL_START (-10UL << 20)
#define VSYSCALL_SIZE 1024
#define VSYSCALL_END (-2UL << 20)
#define VSYSCALL_MAPPED_PAGES 1
#define VSYSCALL_ADDR(vsyscall_nr) (VSYSCALL_START+VSYSCALL_SIZE*(vsyscall_nr))

If you evaluate this for vsyscall_nr = 0 you get 0xffffffffff600000, and it's pretty transparent that the next vsyscall would be placed at 0xffffffffff600400 instead.

More details about the VDSO and vsyscall

TODO: disassemblies

TODO: VDSO entry point

Here's some code to find the VDSO call target on a 32-bit system:

void find_call_target(void *argv) {
    // Goal: Read VDSO entry from auxv.
    uint32_t *cursor = (uint32_t *)argv;
    // skip over argv.
    while(*cursor != 0) cursor ++;
    cursor ++; // terminating NULL
    // skip over envp.
    while(*cursor != 0) cursor ++;
    cursor ++;
    // now in auxv. Want entry with type 0x20.
    while(*cursor != 0x20 && *cursor != 0x0) cursor += 2;

    if(*cursor == 0) {
        fprintf(stderr, "Couldn't find VDSO from auxv.\n");
        exit(1);
    }
    else {
        vdso_call_target = (void (*)(void)) *(cursor + 1);
    }
}

Finding the VDSO target on x86_64 is slightly more complicated, as you have to parse the ELF header to find the entry address, instead of having it provided nicely in an auxiliary vector entry.

Summary

Hopefully, this article should have given you a better understanding of the initial contents of memory when a program on x86_64 Linux begins. I apologize if there was a lot of code examples in this article, but it just seems like the sort of thing you communicate via code.

I'm debating if I should write the next article in this series, on the topic of dynamic linking. If I do so, it will likely be split across three or four posts, as the topic is very large. It will also take a very long time to complete, so don't expect it anytime soon.

Happy hacking,

- ethereal


  1. Sadly, the people who originally added this entry to the auxiliary vector didn't foresee the possibility of having multiple lowest-level page sizes . . . 

  2. If you've never run into cpuid before, I suggest spending an afternoon reading through the relevant section in the Intel manuals. It's interesting stuff. Even better, compare and contrast to the AMD manuals.