Thursday, June 21, 2012

Writing a primitive dynamic loader for Linux

The process of dynamic loading libraries is a complex one.  Infact too complex to elaborate and explain on a blog site.  Thousands of man hours and research has been wasted on dynamic loading.  Things like PIC (Position independant code) take time to perfect and implement and are often subjected to bugs and performance issues. Of course you're not here to listen to me ramble about it.

What I'm going to show you is the bread and butter of writing your own user-space dynamic loader by getting down to the bits and bytes of it.  Before we continue I should suggest you have a working toolchain which includes a compiler (gcc, clang, pathscale will work) and binutils (objdump and objcopy).

Lets start with a simple library function:

int print_hello_world() {
    return printf("Hello World!\n");

This is the function we're going to want to _dynamically_ load from our dynamic loader and execute.  Lets begin by compiling this into an object with gcc.

# gcc hello.c -c hello.o

I assume you know assembly (because you're going to need it) if not here's a small crash course.  Assembly language is hardware-level language.  Every thing done in code is directly translated to a byte-code encoding of various instructions that move arguments in and off the stack, move values and memory around and perform basic arithmetic.  It's very low-level and working with it is nearly impossible.  Which is why there exists assemblers that take an input assembly file (human readable version of the bytes) and assembles it to a (non human readable but computer readable) format.

The first thing we want to do is look at the dissasembly of  our hello.o object file.  To do this we use objdump

# objdump -d hello.o

0000000000000000 <print_hello_world>:
   0:    55                  push  %rbp
   1:    48 89 e5            mov    %rsp,%rbp
   4:    bf 00 00 00 00      mov    $0x0,%edi
   9:    e8 00 00 00 00      callq   e <print_hello_world+0xe>
   e:    5d                  pop    %rbp
   f:    c3                  retq

Note: the output may vary depending on your architecture.
Lets get a few things figured out first.  As you can see we have some basic stack manipulation in this function in accordance to the SysV ABI.  But the thing worth nothing is the `callq`.  The printf function doesn't exist at this point, during link the linker will patch the call with a call to library printf.  We don't want that, why?  because printf itself is a library function which exists in a dynamically loaded library.  We can't use it.  So how do we perform a printf without a printf?

Welcome to the world of system calls.  System calls are the process of invoking the kernel.  Every major OS has system calls of some sort.  But for the sake of linux there is a very specific way at performing a systemcall.  Linux expects arguments for a systemcall to exist in registers in the order of EAX,EBX,ECX,EDX etc etc.  Once the registers contain the values required simply calling int 128 (int 0x80) will perform the systemcall.

So what systemcall does printf?  None.  Instead there is write, which writes to a file descriptor.  The write systemcall expects the fd to write to, the data, and the size of the data.  So lets rewrite our test.c file to accomodate that.

int print_hello_world() {
    const char hello[] = "Hello World!\n";
    const int  size       = sizeof(hello);
    int        ret;
    __asm__ __volatile__ (
        "movl $4,  %%eax\n\t"
        "movl $1,  %%ebx\n\t"
        "movl %1, %%ecx\n\t"
        "movl %2, %%edx\n\t"
        "int $0x80" :
            "=a"(ret) : "g"(hello), "g"(size) :
            "%ebx", "%ecx", "%edx", "%esi", "%edi"
    return ret;

Looks fancy doesn't it?  Lets work out this assembly first
the movl for 4 is the syscall number for write
the movl for 1 is the file descriptor for stdout

This code will output "Hello World!\n" to stdout.  Pretty awesome eh?
Of course now we need to take a look at the assembly of this.

0000000000000000 <print_hello_world>:
   0:    55                        push   %rbp
   1:    48 89 e5                  mov    %rsp,%rbp
   4:    41 54                     push   %r12
   6:    53                        push   %rbx
   7:    c7 45 d0 48 65 6c 6c      movl   $0x6c6c6548,-0x30(%rbp)
   e:    c7 45 d4 6f 20 57 6f      movl   $0x6f57206f,-0x2c(%rbp)
  15:    c7 45 d8 72 6c 64 21      movl   $0x21646c72,-0x28(%rbp)
  1c:    66 c7 45 dc 0a 00         movw   $0xa,-0x24(%rbp)
  22:    c7 45 ec 0e 00 00 00      movl   $0xe,-0x14(%rbp)
  29:    48 8d 45 d0               lea    -0x30(%rbp),%rax
  2d:    48 89 45 c8               mov    %rax,-0x38(%rbp)
  31:    b8 04 00 00 00            mov    $0x4,%eax
  36:    bb 01 00 00 00            mov    $0x1,%ebx
  3b:    8b 4d c8                  mov    -0x38(%rbp),%ecx
  3e:    8b 55 ec                  mov    -0x14(%rbp),%edx
  41:    cd 80                     int    $0x80
  43:    41 89 c4                  mov    %eax,%r12d
  46:    44 89 65 e8               mov    %r12d,-0x18(%rbp)
  4a:    8b 45 e8                  mov    -0x18(%rbp),%eax
  4d:    5b                        pop    %rbx
  4e:    41 5c                     pop    %r12
  50:    5d                        pop    %rbp
  51:    c3                        retq 

Pay special attention to the encoding not the instructions, this is the bytecode the CPU understands and executes, it all encodes to a special meaning, this meaning is on the right.

Normally there would be a problem with this code.  Currently it's PIC, but it might not be PIC (for example if you rely on data outside the function).  Note the three successive movl's at the top, these are encoding for hello world and the size (but they could be addresses if you're not paying special attention).  Simply loading code that uses addresses to read from memory and executing it would cause a segmentation fault (since we don't have access to that memory this).  Instead we need a surfire way to make sure the code is position independant (even if it already is).   Doing it requires a lot of work, you need relocation tables, and patching of all those addresses.  Instead we can pass this information by registers from the dynamic loader (this isn't very usefull but this tutorial is only to explain and implement a primitive dynamic loader).

So lets make this code PIC by rewriting it, to something like this:

int print_hello_world(const char *hello, int size, int *ret) {
    __asm__ __volatile__ (
        "movl $4, %%eax\n\t"
        "movl $1, %%ebx\n\t"
        "movl %1, %%ecx\n\t"
        "movl %2, %%edx\n\t"
        "int $0x80" :
            "=a"(*ret) : "g"(hello), "g"(size) :
            "%ebx", "%ecx", "%edx", "%esi", "%edi"
    return *ret;

as you can see no data on the heap.  Everything fits in registers and is easily PIC.  Of course you cannot make this assumption, lets check the dissasembly.

#objdump -d hello.o

   0:    55                       push   %rbp
   1:    48 89 e5                 mov    %rsp,%rbp
   4:    53                       push   %rbx
   5:    48 89 7d f0              mov    %rdi,-0x10(%rbp)
   9:    89 75 ec                 mov    %esi,-0x14(%rbp)
   c:    48 89 55 e0              mov    %rdx,-0x20(%rbp)
  10:    b8 04 00 00 00           mov    $0x4,%eax
  15:    bb 01 00 00 00           mov    $0x1,%ebx
  1a:    8b 4d f0                 mov    -0x10(%rbp),%ecx
  1d:    8b 55 ec                 mov    -0x14(%rbp),%edx
  20:    cd 80                    int    $0x80
  22:    41 89 c0                 mov    %eax,%r8d
  25:    48 8b 45 e0              mov    -0x20(%rbp),%rax
  29:    44 89 00                 mov    %r8d,(%rax)
  2c:    48 8b 45 e0              mov    -0x20(%rbp),%rax
  30:    8b 00                    mov    (%rax),%eax
  32:    5b                       pop    %rbx
  33:    5d                       pop    %rbp
  34:    c3                       retq 

Hey Look! Not bad, everything is happening between the registers! This is PIC!

Now we need to break this down further! Currently these object files we're working with are in ELF format and not really easy to load.  What if we just rewrote those bytes on the left side of that dissasembly in a file manually we'd have a flat binary right?  Wishfull thinking, that would take too long, this is what objcopy is for.

#objcopy --only-section=.text --output-target binary hello.o hello.bin

We specify we only want the .text section (other sections are useless) .text = code.  We can thus validate the instructions to the objdump using hexdump

# hexdump -C hello.bin

00000000  55 48 89 e5 53 48 89 7d  f0 89 75 ec 48 89 55 e0  |UH..SH.}..u.H.U.|
00000010  b8 04 00 00 00 bb 01 00  00 00 8b 4d f0 8b 55 ec  |...........M..U.|
00000020  cd 80 41 89 c0 48 8b 45  e0 44 89 00 48 8b 45 e0  |..A..H.E.D..H.E.|
00000030  8b 00 5b 5d c3                                    |..[].|

As we can validate the encode is the same as the objdump.  We now have loadable code that we can EXECUTE!  Lets write our dynamic executor! 

#include <stdio.h>
#include <stdlib.h>

unsigned char *mem;
int            i;
long int       size;
int            ret;
int main() {
    FILE *fp = fopen("hello.bin", "rb");
    if  (!fp) {
        printf("[ld] Failed to load hello.bin\n");
        return -1;
    fseek(fp, 0, SEEK_END);
    size = ftell(fp);
    fseek(fp, 0, SEEK_SET);
    printf("[ld] Loading %d byte program\n", size);
    if (!(mem = malloc(size))) {
        printf("[ld] Failed to allocate memory for code\n");
        return -1;
    } else {
        printf("[ld] Reserved %d bytes for program\n", size);
    fread(mem, 1, size, fp);
    printf("[ld] Loaded program\n");
    printf("    ");
    for(i = 1; i < size+1; i++)
        printf((i%8==0)?"0x%02X\n    ":"0x%02X, ", mem[i-1]);
    printf("\n[ld] Executing ...\n");
    int (*call)(const char *, int, int*) = mem;
    (int)(*call)("Hello World\n", sizeof("Hello World\n"), &ret);
    printf("[ld] Called function; returned %d strlen(\"Hello World\")\n", ret);
    return ret;

Now here's the problem if we compile and execute this code:

# gcc run.c -o run
# ./run

[ld] Loading 53 byte program
[ld] Reserved 53 bytes for program
[ld] Loaded program
    0x55, 0x48, 0x89, 0xE5, 0x53, 0x48, 0x89, 0x7D
    0xF0, 0x89, 0x75, 0xEC, 0x48, 0x89, 0x55, 0xE0
    0xB8, 0x04, 0x00, 0x00, 0x00, 0xBB, 0x01, 0x00
    0x00, 0x00, 0x8B, 0x4D, 0xF0, 0x8B, 0x55, 0xEC
    0xCD, 0x80, 0x41, 0x89, 0xC0, 0x48, 0x8B, 0x45
    0xE0, 0x44, 0x89, 0x00, 0x48, 0x8B, 0x45, 0xE0
    0x8B, 0x00, 0x5B, 0x5D, 0xC3,
[ld] Executing ...
Segmentation fault

Wait what! Segmentation fault? There is some issues preventing us from executing code from the stack.  This is a security feature, and simple to fix.  First we need to tell GCC to not build with stack protection and pass the execstack flag to ld so we can execute code from the stack.  Lets try again

# gcc -fno-stack-protector -z execstack run.c -o run
# ./run
[ld] Loading 53 byte program
[ld] Reserved 53 bytes for program
[ld] Loaded program
    0x55, 0x48, 0x89, 0xE5, 0x53, 0x48, 0x89, 0x7D
    0xF0, 0x89, 0x75, 0xEC, 0x48, 0x89, 0x55, 0xE0
    0xB8, 0x04, 0x00, 0x00, 0x00, 0xBB, 0x01, 0x00
    0x00, 0x00, 0x8B, 0x4D, 0xF0, 0x8B, 0x55, 0xEC
    0xCD, 0x80, 0x41, 0x89, 0xC0, 0x48, 0x8B, 0x45
    0xE0, 0x44, 0x89, 0x00, 0x48, 0x8B, 0x45, 0xE0
    0x8B, 0x00, 0x5B, 0x5D, 0xC3,
[ld] Executing ...
Hello World
[ld] Called function; returned 13 strlen("Hello World")

Look at that we printed HELLO WORLD! I should note this isn't the same as dynamic loading but very close in princible.  I hope you learn how all this works and can adapt it for future logic if you ever implement a dynamic loader or evn linker!.  I should disclaimer here that dynamic linkers are a lot more complex, they handle relocation tables to relocate data, and do PIC via binary patching and other complex methods.  But this is a good start!

Some other subtle notes that aren't covered here.

1 The data cache on the CPU should be cleared as well as the instruction cache since there is no gurantee the data and instruction cache will be in sync.

2 The memory for the code should be allocated with mmap and set unprotected (instead of using GCC trickery like I did)



    i made a tool that does this automatically under mingw by moving the code and data together and doing the relocatiosn at runtime by parsing the object file.

  2. Very nice. I should note this is really a beginners insight to this stuff. Disclaimer this code is bad :P It's incorrect but still fun. I'm going to do a nicer post some day which explains more of it and even explains how to make your own quality linker.