Hey fellow pwners!

This is the first post of Reverse Engineering and Binary Exploitation Tutorial Series. We will discuss how a C/C++ program is converted into an executable, which can be run on the system.

Prerequisites: You should have a Linux machine running. If you are using Windows, you can follow this youtube video(Link) and easily install Ubuntu - a student friendly Linux flavor.

Let us start!

Before starting with today’s topic, fire up the terminal(Ctrl+Alt+T). Let us create a directory named rev_eng_series in home directory using mkdir(make directory) command and enter that directory using cd(change directory) command.

$ mkdir rev_eng_series
$ cd rev_eng_series
~/rev_eng_series$ 

This is our first post. So, let us create a directory named post_1 inside rev_eng_series directory and traverse into it.

~/rev_eng_series$ mkdir post_1
~/rev_eng_series$ cd post_1
~/rev_eng_series/post_1$ 

Let us store all the code or anything we do in this post in post_1 directory. This way, we can keep track of our work.

Let us first understand what Compilation means. The dictionary meaning of Compilation is The action or process of producing something. Here, we will take a simple C program and see the process of producing an executable. This is the sourcecode of C program used.

~/rev_eng_series/post_1$ cat code1.c 

#include<stdio.h>
#include<stdio.h>    

#define NUMBER 100

int a = 10;              //Global variable 'a' initialized to 10.
int b;                   //Global variable uninitialized.

int main() {
    int c = 123,number = NUMBER;
    char d = 'x';

    printf("Hello world!\n");

    return 0;
}

Overview

Let us look at the overview of whole conversion process. Later, we will get into details.

  • The gcc compiler(GNU C Compiler - gcc) takes the C sourcefile as an argument and generates an executable file. Store the above C sourcefile as code1.c and compile it in the following manner.

    ~/rev_eng_series/post_1$ gcc code1.c -o code1
    ~/rev_eng_series/post_1$ ls -l code1
    -rwxrwxr-x 1 adwi adwi 8656 Jun 18 22:20 code1
    
  • code1 is the executable generated. Open it up in a text editor(Eg: vi) and see how it looks like. It looks all weird because it is direct machine code(0s & 1s), along with some metadata and few useful strings.

  • Conversion of a C/C++ sourcefile (.c / .cpp files) to executable file is not a single step process, though it might feel like a single step process. Look at the following diagram:

                                                Preprocessing
                                            --------------------- 
      C source code (hello.c)   ----------> |   Preprocessor    | ---------->   hello.i (Intermediate C sourcefile)
                                            ---------------------
    
                                                  Compiling
                                            ---------------------            
                     hello.i    ----------> |     Compiler      | ---------->   hello.s (Assembly code)
                                            ---------------------
    
                                                  Assembling
                                            ---------------------
                     hello.s    ----------> |     Assembler     | ---------->   hello.o (Object code)
                                            ---------------------
                        
                                                    Linking
                                            ---------------------
         hello.o + Libraries    ----------> |       Linker      | ---------->   hello / a.out (Executable)     
                                            ---------------------                
    
  • The conversion process constitutes of 4 sub-processes. They are Preprocessing, Compiling, Assembling and Linking. The objective of this post is to understand each of these sub-processes in detail.

  • Let us compile code1.c in the following manner. This helps our analysis.

    ~/rev_eng_series/post_1$ gcc code1.c -o code1 -save-temps
    
  • Generally, output files generated by Preprocessor, Compiler and Assembler are stored temporarily in /tmp directory which are deleted as soon as the executable is generated. But with -save-temps option, we will save those temporary files also, which will help in our analysis. There are 4 sub-processes, so 4 files are generated. code1.i, code1.s, code1.o and code1. code1 is the final executable.

    ~/rev_eng_series/post_1$ ls -l 
    total 44
    -rwxrwxr-x 1 adwi adwi  8656 Jun 18 22:55 code1
    -rw-rw-r-- 1 adwi adwi   220 Jun 18 22:53 code1.c
    -rw-rw-r-- 1 adwi adwi 17156 Jun 18 22:54 code1.i
    -rw-rw-r-- 1 adwi adwi  1576 Jun 18 22:54 code1.o
    -rw-rw-r-- 1 adwi adwi   594 Jun 18 22:54 code1.s
     adwi@adwi:~/rev_eng_series/post_1$ 
    

Go through the C program thoroughly. Let us start our analysis.

1. Preprocessing :

  • The C-Preprocessor / CPP does the Preprocessing.(Do not get confused between this CPP and .cpp file extension given to C++ sourcefiles).

  • It generates code1.i file. i stands for intermediate.

  • CPP takes code1.c (original C sourcefile) as input and outputs code1.i.

  • CPP does a lot of work before sourcefile goes to next sub-process. They are

a. Expand all the #include s (here, #include < stdio.h >) which are included in C sourcefile : Expand means simple copy contents of a header file into C sourcefile. A header file generally consists of

1. Different function declarations related to the header file(Eg: stdio.h will have function declarations of standard input and output functions).
2. Different macros defined.
3. #include of other related header files.
4. A bunch of typedef s of different datatypes.

b. Replace MACROS(here, #define NUMBER 100) with their actual values: Wherever macro NUMBER would be used in C sourcefile, it would be replaced by it’s value.

Have a look at the last few lines of code1.i file

extern int pclose (FILE *__stream);
extern char *ctermid (char *__s) __attribute__ ((__nothrow__ , __leaf__));
# 912 "/usr/include/stdio.h" 3 4
extern void flockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
extern int ftrylockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) ;
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
# 942 "/usr/include/stdio.h" 3 4

# 2 "code1.c" 2

# 5 "code1.c"
int a = 10;
int b;

int main() {
 int c = 123, number = 100;
 char d = 'x';

 printf("Hello world!!\n");

 return 0; 
}

Observe that all those are function declarations. If you open code1.i using vi and explore, you would realize what we are discussing. Also observe that number = NUMBER is changed to number = 100. The macro was replaced by it’s value.

c. Conditional compilation : In our example, stdio.h is included twice. When a variable is declared twice, we generally get an error saying the variable is already declared once. But here we are including stdio.h twice, meaning the same set of variables, function declarations, other statements are being used twice. But we never got an error while compiling right? Let us understand why. Look at the first few lines of stdio.h(Located at /usr/include/stdio.h).

#ifndef _STDIO_H

#if !defined __need_FILE && !defined __need___FILE
# define _STDIO_H   1

And last line of stdio.h is

#endif /* !_STDIO_H */

Here, conditional compilation is used to make sure stdio.h is included only once. If _STDIO_H is not defined(#ifndef _STDIO_H), then define _STDIO_H as 1(#define _STDIO_H 1). _STDIO_H is a macro used which is defined to 1. When the second #include< stdio.h > is processed, it will first check whether _STDIO_H is defined or not. In this case, _STDIO_H is already defined to be 1. So, all contents after #ifndef is ignored. This way, only the first #include< stdio.h > is considered. The #ifndef is ended by #endif at the end of header file. If you checkout stdio.h (or any other header file for that matter) in detail, you will find a lot of conditional compilation used.

  • All header files are located in /usr/include directory.

  • At the end of preprocessing, we have code1.i which is also C sourcecode, with the above mentioned changes.

2. Compiling :

  • First thing to understand at this point is, Compiling is 1 of 4 sub-processes. But we generally call the whole conversion process as Compiling / Compilation. It is important to note this.

  • Compiler takes code1.i file as input, compiles it and gives code1.s file as output. s in code1.s stands for source. At the time when programmers were writing programs in assembly(before any compiled languages were invented), they used to name the assembly code sourcefiles with .s extension. Now, the same convention is followed.

  • code1.s is the assembly equivalent of code1.c. The machine I am using is a 64-bit Intel machine. Assembly language which can be understood by this machine is x86_64 assembly language. So, the compiler generates assembly code in x86_64 language. Take a look at code1.s .

    ~/rev_eng_series/post_1$ cat code1.s
        .file   "code1.c"
        .globl  a
        .data
        .align 4
        .type   a, @object
        .size   a, 4
    a:
        .long   10
        .comm   b,4,4
        .section    .rodata
    .LC0:
        .string "Hello world!!"
        .text
        .globl  main
        .type   main, @function
    main:
    .LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movl    $123, -8(%rbp)
        movl    $100, -4(%rbp)
        movb    $120, -9(%rbp)
        movl    $.LC0, %edi
        call    puts
        movl    $0, %eax
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
    .LFE0:
        .size   main, .-main
        .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609"
        .section    .note.GNU-stack,"",@progbits
    
  • Unlike the way we have one C programming language, every processor in the market has a different architecture and is designed to understand a specific assembly language. That way, there are many assembly languages, each present to run on a specific processor. There cannot be 2 entirely different assembly languages for the same processor. Few Examples of processors and their assembly languages are

a. Intel’s core i7 - x86_64 assembly lang

b. Intel’s Pentium - x86 assembly lang

c. IBM’s PowerPC - ppc, ppc64 assembly lang

d. Oracle’s Sparc - SPARC assembly lang

  • If we have a ppc .s file, it can run only on a PowerPC machine and it cannot run on Intel processors.

  • Compiler does a lot of processing to generate .s file from .c/.cpp file. Let us not get into Compiler design but we will understand what all a compiler does.

    a. Conversion of C/C++ programs to assembly language.

    b. It does all the optimizations required(unless you don’t want it).

    c. It eliminates dead code. Sometimes, there will be pieces of code which will never get executed . Such pieces are identified and removed .

    d. One of the most important one. It strips or removes names and datatypes of local variables. If you refer code1.c file, there were 3 local variables - int c = 123, number = NUMBER and char d = ‘x’. If you search for these names c, number, d or their respective datatypes, you won’t be able to find in code1.s file because it has been removed by the compiler.

  • We have code1.s, the assembly lang equivalent with us after compilation sub-process.

3. Assembling :

  • A system program called Assembler assembles code1.s to give code1.o file.

  • The .o in code1.o stands for object.

  • An object file has pure machine code(0s & 1s) along with some metadata useful for next sub-process.

  • If you open code1.o with a text editor, it is really hard to understand what it contains. So, let us use a command-line tool called objdump to analyze it’s contents.

  • objdump stands for object dump, which means “give the dump of object file specified”. Let us see what that dump contains. This is how you use objdump. The file code1.o.objdump will have the dump. (Do not worry about the file extensions. They are given just to make our job easy. Linux OSs do not give much importance to extensions of files)

    ~/rev_eng_series/post_1$ objdump -Mintel -D code1.o > code1.o.objdump
    ~/rev_eng_series/post_1$
    
  • It typically has 5 sections: .text, .data, .rodata, .comment and .eh_frame. We have got the disassembly of each section present in code1.o object file. Disassembly means, disassembling / undoing the assembled file(object file) and getting it’s assembly code back(with certain changes). Let us focus on the first 3 sections: .text, .data and .rodata.

a. .text section : This section consists of machine code of all functions we would have written in C sourcefile. In our example, main is the only function. Take a look at this.

    code1.o:     file format elf64-x86-64


    Disassembly of section .text:

    0000000000000000 <main>:
       0:   55                      push   rbp
       1:   48 89 e5                mov    rbp,rsp
       4:   48 83 ec 10             sub    rsp,0x10
       8:   c7 45 f8 7b 00 00 00    mov    DWORD PTR [rbp-0x8],0x7b
       f:   c7 45 fc 64 00 00 00    mov    DWORD PTR [rbp-0x4],0x64
      16:   c6 45 f7 78             mov    BYTE PTR [rbp-0x9],0x78
      1a:   bf 00 00 00 00          mov    edi,0x0
      1f:   e8 00 00 00 00          call   24 <main+0x24>
      24:   b8 00 00 00 00          mov    eax,0x0
      29:   c9                      leave  
      2a:   c3                      ret    
  • Note on objdump output: First column from right (push rbp, mov rbp, rsp etc.,) are assembly instructions. The middle column is hexadecimal equivalent of those assembly instructions. You can think of First column from left as serial numbers for now.

  • We observed that names and datatypes of local variables are removed during compilation. Instead of names and datatypes, compiler gives an address space of 4 bytes for integers, 1 byte for character variables. Eg: 0x7b = 123 in decimal. It is stored at address rbp - 0x08(Do not worry about what rbp is, will explain in next post in detail). So, whenever we refer to variable c in our C program(in code1.c), at assembly level, it is being referred by rbp-0x08. This is a rough example. Will give clear details about this in the next post.

b. .data section : This section consists of Global and static variables. Ideally, objdump should give disassembly of only text section because that is the only section containing machine code. But, objdump is not intelligent enough. That is why, it is disassembling even .data section which you don’t have to worry about.

    Disassembly of section .data:

    0000000000000000 <a>:
       0:   0a 00                   or     al,BYTE PTR [rax]

c. rodata section : This section consists of all read-only(ro) data. In our example, Hello world!!\n string is the only read-only item in the file.

    Disassembly of section .rodata:

    0000000000000000 <.rodata>:
       0:   48                      rex.W
       1:   65 6c                   gs ins BYTE PTR es:[rdi],dx
       3:   6c                      ins    BYTE PTR es:[rdi],dx
       4:   6f                      outs   dx,DWORD PTR ds:[rsi]
       5:   20 77 6f                and    BYTE PTR [rdi+0x6f],dh
       8:   72 6c                   jb     76 <main+0x76>
       a:   64 21 21                and    DWORD PTR fs:[rcx],esp
  • If you closely look, 0x48 is ascii number for H, 0x65 for e, 0x6c for l and so on. You can use ascii command line tool for reference. If it not installed, you can install it in this way.

    ~/rev_eng_series/post_1$ sudo apt-get install ascii
        
    ~/rev_eng_series/post_1$ ascii
        
    Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex  
      0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
      1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
      2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
      3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
      4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
      5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
      6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
      7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
      8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
      9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
     10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
     11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
     12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
     13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
     14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
     15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL
    
  • NOTE:

    • Every instruction and section should have an address right? But here all sections are starting with address zero. How can 2 section have same address or be at the same address??

    • Observe .data section. There is no mention of int b, the uninitialized global variable. But if we have used it, it should be somewhere right?

    • The data present in .rodata section cannot be executed by the processor. It is read-only, non-executable, non-writable data. objdump simply converted the data in .rodata section to it’s assembly equivalent, but it makes no sense because the whole section is non-executable section.

    • Observe .text section. There is not mention of printf we had used in code1.c . But note that there is a call instruction(1f: line).

    • There are more, but these are the important ones.

To resolve few of the issues mentioned above, let us use another tool called readelf to analyze code1.o .

  • ELF: stands for Executable and Linkable Format. For now, it is enough to know that any file which we want to execute on a Linux machine must be in this format.A file of any other format cannot be run even if it has machine code in it. Similar to ELF, Windows has it’s own executable format. It is known as PE( Portable Executable ) file format.

a. Object file(here code1.o) contains a table known as Symbol Table. Take a look at this symbol table.

    ~/rev_eng_series/post_1$ readelf -s code1.o

    Symbol table '.symtab' contains 13 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
         0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
         1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS code1.c
         2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
         3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
         4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
         5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5 
         6: 0000000000000000     0 SECTION LOCAL  DEFAULT    7 
         7: 0000000000000000     0 SECTION LOCAL  DEFAULT    8 
         8: 0000000000000000     0 SECTION LOCAL  DEFAULT    6 
         9: 0000000000000000     4 OBJECT  GLOBAL DEFAULT    3 a
        10: 0000000000000004     4 OBJECT  GLOBAL DEFAULT  COM b
        11: 0000000000000000    43 FUNC    GLOBAL DEFAULT    1 main
        12: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND puts
    ~/rev_eng_series/post_1$ 
  • Focus on symbol numbers 9, 10, 11 and 12. Their names are a, b, main and puts respectively.

    • a is a global object of size 4 bytes.

    • b is a global object of size 4 bytes. COM stands for COMMON symbol.

    • main is a global function of size 43 bytes.

    • puts is a global symbol, but it’s type is not known(NOTYPE). So, at this stage, assembler does not know what puts is(though we know it). NOTE: When there are no format strings in printf(), some compilers replace printf() with puts(). That is why, there is a puts() here instead of printf().

b. Object file also has a section called Relocation Section. have a look at this:

    ~/rev_eng_series/post_1$ readelf -r code1.o

    Relocation section '.rela.text' at offset 0x240 contains 2 entries:
      Offset          Info           Type           Sym. Value    Sym. Name + Addend
    00000000001b  00050000000a R_X86_64_32       0000000000000000 .rodata + 0
    000000000020  000c00000002 R_X86_64_PC32     0000000000000000 puts - 4

    Relocation section '.rela.eh_frame' at offset 0x270 contains 1 entries:
      Offset          Info           Type           Sym. Value    Sym. Name + Addend
    000000000020  000200000002 R_X86_64_PC32     0000000000000000 .text + 0
    ~/rev_eng_series/post_1$ 
  • We will come to the meaning of Relocation in the next sub-process.

    • Now, just observe that there are 2 symbols .rodata and puts in .rela.text section.
  • This means code1.o has information about puts() in it’s Symbol Table and Relocation Section.

We rectified a few issues mentioned in the NOTE, but not all. We still have to see what relocation is and what happens to puts.

4. Linking :

  • The last part of conversion is linking and is done by a system program called linker. It takes one or more object files and Shared Libraries(like libc) as input. If Linking is successful, it generates the executable file. Else, it gives a Linking Error.

  • An object file has no absolute addresses. Every section started with address 0 and other stuff in a particular section was numbered relative to starting address 0. But this is not possible in an actual executable file. Every section should have a definite / absolute address. The Linker Relocates(or shifts) each section in such a manner that every section has a unique starting address. This the meaning of Relocation.

  • Linker links symbols present in Relocation Table to their definitions . This is known as Symbol Resolution. Eg:

    • The symbol main is linked to .text + 0x00 because that is where body of main function is defined. Then how and what will it link puts to? We just have it’s symbol in Relocation Table, but we never explicitly defined it anywhere in out C program.

    • The linker finds the definition of puts in libc / Standard C Library and will link puts to that .

  • Linker then gives absolute address to every section in object file and adds a few more sections , thus making it a complete executable file.

  • Let us check out code1 executable file using readelf.

    ~/rev_eng_series/post_1$ readelf -r code1
        
    Relocation section '.rela.dyn' at offset 0x380 contains 1 entries:
      Offset          Info           Type           Sym. Value    Sym. Name + Addend
    000000600ff8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
        
    Relocation section '.rela.plt' at offset 0x398 contains 2 entries:
      Offset          Info           Type           Sym. Value    Sym. Name + Addend
    000000601018  000100000007 R_X86_64_JUMP_SLO 0000000000000000 puts@GLIBC_2.2.5 + 0
    000000601020  000200000007 R_X86_64_JUMP_SLO 0000000000000000 __libc_start_main@GLIBC_2.2.5 + 0
    ~/rev_eng_series/post_1$ 
    
    • Observe the first entry of .rela.plt section. There is puts. Linker has identified that it is a C library function. In normal situations, libc functions are Dynamically Linked. That is, their definitions will be present in libc.so shared object file. They are given absolute addresses when they are called. That is why it is known as Dynamic Linking.

    • Finally, we will just look at disassembly of main function in the executable file and wrap up the post.

      ~/rev_eng_series/post_1$ objdump -Mintel -D code1 > code1.objdump
      
    • This is the main function:

      0000000000400526 <main>:
        400526:       55                      push   rbp
        400527:       48 89 e5                mov    rbp,rsp
        40052a:       48 83 ec 10             sub    rsp,0x10
        40052e:       c7 45 f8 7b 00 00 00    mov    DWORD PTR [rbp-0x8],0x7b
        400535:       c7 45 fc 64 00 00 00    mov    DWORD PTR [rbp-0x4],0x64
        40053c:       c6 45 f7 78             mov    BYTE PTR [rbp-0x9],0x78
        400540:       bf e4 05 40 00          mov    edi,0x4005e4
        400545:       e8 b6 fe ff ff          call   400400 <puts@plt>
        40054a:       b8 00 00 00 00          mov    eax,0x0
        40054f:       c9                      leave
        400550:       c3                      ret
        400551:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
        400558:       00 00 00
        40055b:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
      
    • You can notice that every instruction has an absolute address. There is a call 400400< puts@plt > which is a call to puts function. You can go through the whole code1.objdump. Every byte has a unique address. All the sections .data , .rodata etc., have absolute addresses. There are new sections added by the linker to make it a working executable.

    • That was about Linking. If you do not get any linking errors, you will have an executable in the directory, which can be run in the following manner.

      ~/rev_eng_series/post_1$ /code1 
      Hello world!! 
      ~/rev_eng_series/post_1$
      

This is how an executable is generated from a C/C++ program.

A few other interesting things :

  1. If there is only 1 C sourcefile like code1.c, an executable is generated in the above explained manner. But huge projects will have multiple sourcefiles. How is this situation handled?

    • Let is take an example and understand what happens. Consider 2 C sourcefiles source1.c and source2.c.

      ~/rev_eng_series/post_1$ cat source1.c 
      #include<stdio.h>
      void func();
              
      int main() {
              
          printf("In main()\n");
          func();
      }
      ~/rev_eng_series/post_1$ cat source2.c
      void func() {
              
          printf("In func()\n");
          main();
      }
      ~/rev_eng_series/post_1$ 
      
    • Compile these 2 files in the following manner.

      ~/rev_eng_series/post_1$ gcc source1.c source2.c -o target -save-temps
      
    • Go through the warnings you get. It is alright to ignore these warnings. You will have these files after compilation.

      ~/rev_eng_series/post_1$ ls -l source*
      -rw-rw-r-- 1 adwi adwi    82 Jun 21 21:12 source1.c
      -rw-rw-r-- 1 adwi adwi 17121 Jun 21 21:19 source1.i
      -rw-rw-r-- 1 adwi adwi  1568 Jun 21 21:19 source1.o
      -rw-rw-r-- 1 adwi adwi   481 Jun 21 21:19 source1.s
      -rw-rw-r-- 1 adwi adwi    50 Jun 21 21:12 source2.c
      -rw-rw-r-- 1 adwi adwi   182 Jun 21 21:19 source2.i
      -rw-rw-r-- 1 adwi adwi  1568 Jun 21 21:19 source2.o
      -rw-rw-r-- 1 adwi adwi   471 Jun 21 21:19 source2.s
      ~/rev_eng_series/post_1$ ls -l target
      -rwxrwxr-x 1 adwi adwi 8672 Jun 21 21:19 target
      
    • You can notice that there is a .i file, .s file and an object file(.o) for each of C sourcefiles. From this you can conclude that each C sourcefile is preprocessed, compiled and assembled independently. Then all the object files along with shared libraries are taken by the linker and are stitched together to give 1 executable target here.

    • This is how executable is generated when there are multiple C sourcefiles.

  2. Determining the Entry Point of execution:

    • When an executable is run, there is always a starting point( or a first instruction) from where execution starts. We have written the main function. It is normal to think that main is the entry point of execution. But let us go a little deep into this. Using readelf tool, let us analyze the executable’s(code1) header.

         ~/rev_eng_series/post_1$ readelf -h code1
         ELF Header:
         Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
         Class:                             ELF64
         Data:                              2's complement, little endian
         Version:                           1 (current)
         OS/ABI:                            UNIX - System V
         ABI Version:                       0
         Type:                              EXEC (Executable file)
         Machine:                           Advanced Micro Devices X86-64
         Version:                           0x1
         Entry point address:               0x400430
         Start of program headers:          64 (bytes into file)
         Start of section headers:          6672 (bytes into file)
         Flags:                             0x0
         Size of this header:               64 (bytes)
         Size of program headers:           56 (bytes)
         Number of program headers:         9
         Size of section headers:           64 (bytes)
         Number of section headers:         31
         Section header string table index: 28
         ~/rev_eng_series/post_1$ 
      
    • Look at the 10th entry. It says Entry point address = 0x400430. Take a look at starting address of main function. It is 0x400526. This means main function is not the entry point. Open code1.objdump in a text editor and check for address 0x400430.

    • A label called _start is the entry point of any executable. This requires some explanation. In the object file(code1.o), we have machine code of all the functions and other important sections. For the system to run these functions(in our example, only main function), the system should be prepared to execute functions which we have written. All the preparation and initialization should take place before main function. So, the linker identifies a label named _start as entry point. From _start till main function is called, all the preparation happens. After that, main is called.

    • _start label is a compulsory label. In future posts when we try out direct assembly programming, we never use main. We start the program with _start because linker searches for it while linking.

    • So, main is a function like any other we write but with a few special properties from a programmer’s point of view.

    • We can say that main is the entry point from programmer’s perspective and _start is the entry point from system’s perspective.

  3. A small note on usage of #define:

    • As we discussed in the post, all the macros are replaced by their values by CPP. So, these macros (NUMBER) never get processed by the compiler. Consider the following scenario.

      • There is a huge code base that you have not written and there is a macro #define NUMBER 100 in one of the sourcefiles which you are not aware of. When you compile it, suppose you are getting an error related to this number 100. The compiler gives where exactly the error is. But you would never know that the 100 was a macro first, which got processed by CPP. We would not even know if it is a macro. Even if we somehow come to know, it is really hard to find this macro in the huge code base.

      • There is a solution for this. Instead of #define NUMBER 100, you can use const int NUMBER = 100. So, when there is a compilation error, we will know that error is with NUMBER because it is a variable and not a macro. From system’s perspective, we are using 4 extra bytes. But the ease of maintenance and ease with which we can catch the errors is enhanced.

      • Sometimes, we use #define to define small functions like finding square of a number. For the same reason stated above, you can use inline functions instead of #define.

Conclusion :

I hope this post has given an idea on how a C/C++ program is converted in an executable. I suggest to read this slowly and understand than to rush through the article.

There were many things related to Symbol table, Relocation table which we did not get into detail. We will take it up in the next post. Next post will be on Internal Structure of an executable file. We will discuss what ELF is in detail, what dynamic linking is, how exactly a function like puts() is linked at runtime and more.

That is it for this post. I thoroughly enjoyed writing it. Hope you enjoyed reading it :)


Go to next post: What does an Executable contain? - Internal Structure of an ELF Executable - Part1