Intro to x86 Hacking: Function Prologues

Intro to x86 Hacking: Function Prologues

Hacking is like doing a magic trick, it impresses the uninitiated precisely because they don't understand what they are seeing. The hacker is a systems thinker, an exceptional one who knows the rules of the system so well that it seems like they are cheating. Typically we think of hacking in the context of a computer savvy nerd crafting program exploits. While this is true many people have a hacker mindset without even knowing it. Anyone who can find novel ways of tweaking and bending the rules to their advantage is a hacker.

Take for example former reality TV star, former U.S. President and insurrectionist Donald J. Trump who, despite taxation being theft, is often denounced for (allegedly) legally evading taxes. Yet, the man remains outside of a federal prison cell. How? Well his team of attorneys and lawyers are hackers in the truest sense; they understood the tax code so well that they were able to follow the rules in a way as to allegedly avoid taxation.

We engaged on the intellectual odyssey of becoming a hacker precisely to avoid the idiotic discussions of "political" people with too much time on our hands. Needless to say it's time to get our hands duty and do some real work. Whether you're playing Pokemon on a Game Boy, texting your friends or mindlessly wasting your day with meaningless, irrelevant and addictive content on one of Big Brothers social media platforms the rules of the computer are always the same. Every computer has a series of instructions that it follows. These instructions are in an unreadable form of 1's and 0's known as machine code. We'll never directly deal with machine code, instead we settle for the next best thing: Assembly Language.

Assembly Language is an architecture specific form of mnemonic instructions executed by the CPU. It is the lowest and most precise form of communication between man and machine and understanding assembly language will allow us to perform the sleight of hand that is computer hacking.

Higher level programming languages like Python, C, etc. are either compiled or interpreted into machine code. Once again assembly language is the human readable for of machine code and to exploit a program we need to be able to read it in order to see what the computer is actually doing behind the scenes.

Below we write a program in C that prints our name 10 times, how dreadfully unoriginal. I'll assume you know the basics of C, I'm a novice at the language and don't really feel justified in teaching it to anyone. Additionally I find it boring to go over basic programming concepts like variable deceleration, libraries and etc. The internet is large, if I write something you don't understand look it up.

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]){
    int i = 0;

    if(argc < 2){
        printf("Usage: %s <name>\n", argv[0]);
        exit(-1);
    }

    for(i; i<10; i++){
        printf("i = %d\n", i);
        printf("Hello, World ! My name is %s\n", argv[1]);
    }
}

As you can see this is a relatively simple program. We'll compile our program into an executable, i.e. translate it to machine language with gcc and run it.

$ gcc -g -o basics basics.c
$ ./basics Corey
i = 0
Hello, World ! My name is Corey
i = 1
Hello, World ! My name is Corey
i = 2
Hello, World ! My name is Corey
i = 3
Hello, World ! My name is Corey
i = 4
Hello, World ! My name is Corey
i = 5
Hello, World ! My name is Corey
i = 6
Hello, World ! My name is Corey
i = 7
Hello, World ! My name is Corey
i = 8
Hello, World ! My name is Corey
i = 9
Hello, World ! My name is Corey

No surprises there, the program runs smoothly exits without any complaints. Now, for the fun part we're going to fire up a debugger and see our program in it's compiled form as machine code. The debugger is what the microscope is to a biologist, it lets us look at the nitty gritty, notice how the short sweet C program becomes larger and more complex when we look at it as assembly.

$ gdb -q basics
Reading symbols from basics...(gdb)
set disassembly-flavor intel
disass main
0x0000000000001169 <+0>: endbr64    
0x000000000000116d <+4>: push   rbp   
0x000000000000116e <+5>: mov    rbp,rsp   
0x0000000000001171 <+8>: sub    rsp,0x20   
# the rest of the program was truncated for brevity/sanity

We start by attaching our executable to gdb and setting the disassembly-flavor, or format of the language to intel. There are to dialects of assembly AT&T 🤢 and Intel. Because I grew up reading Intel I refuse to even look at AT&T because whenever I do I forget that I'm reading it and revert to reading the instructions as if they were Intel based. To avoid that whole mess I only read the Intel syntax.

We disassemble, dump the assembly code, of our main function with the disass command, and we're greeted by a whole lot of things that we may not understand which is completely O.K. . Instructions live somewhere in memory, and memory has addresses. Those funny looking 0x... are hexadecimal numbered addresses for each instructions. The memory of a program is broken up into a series of sections, our instructions reside in the text section which has the lowest address numbers of any section.

Hexadecimal is a base-16 numbering system as opposed to the base-10 system that you (hopefully) learned in school. Why do computers use hexadecimal numbers? Well the fundamental unit is the byte, which is 8 bits and this feeds nicely into a base-16 numbering system. Like a base 10 system numbers 1-9 are represented by 1-9, however 10-16 are denoted by the letters A-F. Don't worry about knowing exactly what a hexadecimal number is in base-10, in terms of hacking you just need to be able to tell which numbers are larger or smaller than others.

To the right of each instructions address is the actual instruction. Each instruction follows the basic syntax of <operation> <destination>, <source>, at least in Intel syntax. The operation is the instruction being executed by the computer, these can be mathematical operations, allocating data in memory, calling functions and more. The destination/source portion can either be a register or a memory address. In these first few lines RSP and RBP are registers.

A register is a place for the computer to store and operate on data. Essentially registers are the variables of assembly language. Unlike the variables of higher level programming languages the register variables don't have to be declared, there are a fixed number of them and they can store a finite amount of data. You may have heard of 8 bit, 32 bit and 64 bit computers, this is referring to the size or width of the registers.

There are two types of registers: general and special purpose. A general purpose register is a lot more like the variables you're used to seeing in algebra or other programming languages, they're a place to store and work with data. Special purpose registers, as their name implies, are special in that they have the job of coordinating the program as it runs in memory. Special purpose registers store memory addresses, and because of this they are refereed to as pointers as in pointing to another place in memory.

Refer back to the first few lines of the program:

0x000000000000116d <+4>: push   rbp   
0x000000000000116e <+5>: mov    rbp,rsp   
0x0000000000001171 <+8>: sub    rsp,0x20

What you see above is known as the function prologue, every time a function is called some variation of these three lines of code will be executed to setup a data structure known as the stack. The stack is just a place in memory where data can be stored for the program, remember there are a limited number of registers and more often than not we need another place to store our data.

Picture the stack as a column, it has a top and a bottom. RBP is the Base Pointer register, the memory address it stores is the bottom of the column. RSP is the stack pointer and the memory address it stores is the one at the top of the stack. Confusingly enough the top of the stack RSP, will have the lowest memory address, which means that the bottom of the stack RBP will have the highest memory address. After the function prologue executes the stack pointer will always be at a lower address than the base pointer. Remember, the stack grows up towards lower memory addresses.

The function prologue always follows the same basic sequence and this sets up the stack frame or context for the current function. 1) take the latest address on the stack and push it into the RBP register. 2) copy the address from RBP to RSP. 3) Grow the stack downwards by subtracting from address stored in RSP.

In the first instruction the push instruction moves the memory address that marks the beginning of the stack and places it into the RBP register. This will mark the beginning of the new stack frame and the end of the old one if any exists. In the second instruction the address held in RBP is moved into RSP, at this point RBP and RSP are equal. Lastly, in the third instruction the computer creates a new stack frame by subtracting from RSP. Why subtraction? Because the stack grows up towards lower memory addresses. In the example above the main functions stack frame is 0x20 bytes of memory, or 32 bytes.

Now, this all might seem like a lot of work to do every time a function is called, and you're right it is. The stack provides an enormous advantage to computers because it offers a way to store and utilize data that exists outside of registers. We'll uncover just how the computer references data with the stack in the next tutorial.