Rust offers a bunch of primitive datatypes - integer variants(signed, unsigned, 8/16/32/64/128 bit, native(i/usize)), char, boolean, string literal, array, tuple and slices.

In this post, we will be exploring a few of those types - integer types, array and tuple.

If you don’t have rust toolchain installed in your system, you can go through this to create a virtual environment and install the rust toolchain there.

All code used/written in this post is present here.

0. Code optimization

Before we explore the datatypes, let us look at the different optimization levels present in rustc as well as in gcc.

The Rust compiler offers three levels of optimizations - the same way gcc offers three levels. Let us take an example and go through the three levels.

Consider the following C program.

rust/Rust-C-experiments/primitive-types > cat opt_example.c 
#include <stdio.h>
#include <stdint.h>

int main()
{
    uint32_t    x = 10;
    return 0;
}

I am using gcc-4.8.5. Let us compile it with no optimization (or optimization level 0) and get its disassembly.

rust/Rust-C-experiments/primitive-types > gcc opt_example.c -o opt_example_c0 -O0
rust/Rust-C-experiments/primitive-types > objdump -Mintel -D opt_example_c0 > opt_example_c0.obj

The following is main’s assembly code.

00000000004004ed <main>:                                                        
  4004ed:   55                      push   rbp                                  
  4004ee:   48 89 e5                mov    rbp,rsp                              
  4004f1:   c7 45 fc 0a 00 00 00    mov    DWORD PTR [rbp-0x4],0xa              
  4004f8:   b8 00 00 00 00          mov    eax,0x0                              
  4004fd:   5d                      pop    rbp                                  
  4004fe:   c3                      ret                                         
  4004ff:   90                      nop

The code generated goes well with the general understanding that a local variable is stored in the stack. The local variable x is placed in the stack at the memory location rbp-0x4. Then 10(or 0xa) is loaded into it. Then the function returns. This assembly code is an exact equivalent of our C code.

Now, let us compile the code with -O1 optimization level and check.

00000000004004ed <main>:                                                        
  4004ed:   b8 00 00 00 00          mov    eax,0x0                              
  4004f2:   c3                      ret                                         
  4004f3:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]          
  4004fa:   00 00 00                                                            
  4004fd:   0f 1f 00                nop    DWORD PTR [rax]

This code is almost nothing. It’ll load 0x0 into eax - which is the assembly equivalent of return 0. The local variable x is not seen here. Is this valid code?

Anyway x is not being used anywhere. Why should it be given memory and why should it be initialized? That is what the compiler is thinking. This actually looks like the most optimized code. What else can be done?

Take a look at the assembly code generated by -O2 optimization level.

0000000000400400 <main>:                                                        
  400400:   31 c0                   xor    eax,eax                              
  400402:   c3                      ret

Is there an optimization here? The mov is replaced with an xor instruction. How is that an optimization? End goal is to zeroize the eax register. How it is done is left to the compiler. xor instruction is a much faster instruction than mov. One more aspect is the instruction size. The mov instruction took is 5 bytes long whereas xor instruction is 2 bytes long. You load less number of bytes to do the same work.

For gcc 4.8.5, even -O10 is a valid optimization level. gcc didn’t return any error for using -O10. But the code was same as -O2.

Now that we have setup a frame of reference to measure the code emitted by the Rust compiler, let us take the following rust program.

rust/Rust-C-experiments/primitive-types > cat opt_example.rs
fn main()
{
    let x: u32 = 10;
}

Let us compile it with optimization level 0 and get its disassembly.

rust/Rust-C-experiments/primitive-types > rustc opt_example.rs -o opt_example_rs0 -C opt-level=0
warning: unused variable: `x`
 --> opt_example.rs:3:9
  |
3 |     let x: u32 = 10;
  |         ^ help: if this is intentional, prefix it with an underscore: `_x`
  |
  = note: `#[warn(unused_variables)]` on by default

warning: 1 warning emitted

Get its disassembly.

rust/Rust-C-experiments/primitive-types > objdump -Mintel -D opt_example_rs0 > opt_example_rs0.obj

The following is rust main’s disassembly.

0000000000007cc0 <_ZN11opt_example4main17h268d7c689877f0faE>:                   
    7cc0:   c3                      ret                                         
    7cc1:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]          
    7cc8:   00 00 00                                                            
    7ccb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

This is the code emitted with opt-level=0. This has optimized out the variable x. There is a simple return and nothing else. For the main function, there is nothing else to optimize. So, opt-level=1,2 emit the same code.

I conveniently ignored all the nop(No OPeration) instructions present after the ret. A function is no more after it executes the ret`` instruction, but why are therenops afterret```? Aligning functions, loops at a certain byte-boundary like 16/32 bytes is also an optimization. This stackoverflow answer gives a good explanation. This ofcourse increases the size of the binary a little bit.

What we saw is a very naive, simple example of compiler optimization. Present day compilers are extremely intelligent and have various types of optimizations. As we move forward, we will write larger programs, with different language constructs where we get to see various different optimizations.

Optimizations can be so aggressive that the emitted code may not resemble the code you have written. So, get ready for optimizations and surprises. When you see an optimized version, you should compare it with the code you have written and see if it presents your intention - even though the code is different.

With that, let us start with our first class of primitive types, the integer types.

1. Integer datatypes

Because we are analyzing code emitted by the Rust compiler, what do we compare it with? We need a frame of reference and that would be the code emitted by gcc - for equivalent C programs. We will first write the C program, look at the generated code, thereby setting up our frame of reference. Then jump into rust code.

Let us consider the following C program.

rust/Rust-C-experiments/primitive-types > cat uint.c
#include <stdio.h>
#include <stdint.h>

void dummy (uint32_t i)
{
    printf("%u\n", i);
}

int main ()
{
    uint32_t i = 10;
    dummy(i);
}

I want you to check how the main code looks like - when compiled with the 3 gcc optimization levels.

The following is with no optimization or optimization level 0.

000000000040054e <main>:                                                        
  40054e:   55                      push   rbp                                  
  40054f:   48 89 e5                mov    rbp,rsp                              
  400552:   48 83 ec 10             sub    rsp,0x10                             
  400556:   c7 45 fc 0a 00 00 00    mov    DWORD PTR [rbp-0x4],0xa              
  40055d:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]              
  400560:   89 c7                   mov    edi,eax                              
  400562:   e8 c6 ff ff ff          call   40052d <dummy>                       
  400567:   c9                      leave                                       
  400568:   c3                      ret                     
  400569:   0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

This is the expected code. First, the some memory is allocated on stack - sub rsp, 0x10. That is then moved into edi - which is the register used to pass the first argument. Then dummy is called. Pretty straightforward.

Think about it for a moment, can this be optimized? Sure it can be. The variable doesn’t have to be stored in the stack - it can be directly loaded to edi register. With that, the program will still run as intended. That brings us to the code emitted by -O1.

0000000000400547 <main>:                                                        
  400547:   48 83 ec 08             sub    rsp,0x8                              
  40054b:   bf 0a 00 00 00          mov    edi,0xa                              
  400550:   e8 d8 ff ff ff          call   40052d <dummy>                       
  400555:   48 83 c4 08             add    rsp,0x8                              
  400559:   c3                      ret                                         
  40055a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]

So its being done in -O1. Try to see what can be optimized before you see the optimized code. One wierd thing in the above code is that is is allocating 8 bytes on the stack - sub rsp, 0x8 when it is not needed.

What else can be optimized? Take a look at code emitted with -O2 flag.

0000000000400440 <main>:                                                        
  400440:   be 0a 00 00 00          mov    esi,0xa                              
  400445:   bf e0 05 40 00          mov    edi,0x4005e0                         
  40044a:   31 c0                   xor    eax,eax                              
  40044c:   e9 bf ff ff ff          jmp    400410 <printf@plt>

The whole dummy function is optimized out! There is a direct call to printf in main. This is not how we wrote the program, but the program still works as intended. So, no worries.

We have setup the frame of reference. Let us jump into Rust code.

rust/Rust-C-experiments/primitive-types > cat uint.rs
fn main()
{
    let mut x: u32 = 10;
    dummy(x);
}

fn dummy(mut x: u32)
{
    println!("{}", x);
}

Note that the idea is to write rust programs as close to that C program. And in C, everything is mutable by default. We could have used const in C and removed mut in Rust. Compiling this will obviously give warnings.

rust/Rust-C-experiments/primitive-types > rustc uint.rs -o uint_rs0 -C opt-level=0
warning: variable does not need to be mutable
 --> uint.rs:3:9
  |
3 |     let mut x: u32 = 10;
  |         ----^
  |         |
  |         help: remove this `mut`
  |
  = note: `#[warn(unused_mut)]` on by default

warning: variable does not need to be mutable
 --> uint.rs:7:10
  |
7 | fn dummy(mut x: u32)
  |          ----^
  |          |
  |          help: remove this `mut`

warning: 2 warnings emitted

Run the program and make sure it prints x. Disassemble it and look at main’s code.

0000000000008280 <_ZN4uint4main17h8ca9df4d1f781c43E>:                           
    8280:   50                      push   rax                                  
    8281:   bf 0a 00 00 00          mov    edi,0xa                              
    8286:   e8 05 00 00 00          call   8290 <_ZN4uint5dummy17hc21d4f4c11081246E>
    828b:   58                      pop    rax                                  
    828c:   c3                      ret                                         
    828d:   0f 1f 00                nop    DWORD PTR [rax]

This looks like the code emitted by gcc with -O1

Take a look at the dummy function.

0000000000008290 <_ZN4uint5dummy17hc21d4f4c11081246E>:                          
    8290:   48 83 ec 78             sub    rsp,0x78                             
    8294:   48 8d 35 f5 f4 02 00    lea    rsi,[rip+0x2f4f5]        # 37790 <_ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$3fmt17h2e6d06c5e3120a4fE>
    829b:   89 7c 24 2c             mov    DWORD PTR [rsp+0x2c],edi             
    829f:   48 8b 05 d2 d2 23 00    mov    rax,QWORD PTR [rip+0x23d2d2]        # 245578 <__dso_handle+0x58>
    82a6:   48 8d 4c 24 2c          lea    rcx,[rsp+0x2c]                       
    82ab:   48 89 4c 24 70          mov    QWORD PTR [rsp+0x70],rcx             
    82b0:   48 8b 7c 24 70          mov    rdi,QWORD PTR [rsp+0x70]             
    82b5:   48 89 44 24 20          mov    QWORD PTR [rsp+0x20],rax             
    82ba:   e8 41 fe ff ff          call   8100 <_ZN4core3fmt10ArgumentV13new17h10447e08ba50c839E>
    82bf:   48 89 44 24 18          mov    QWORD PTR [rsp+0x18],rax             
    82c4:   48 89 54 24 10          mov    QWORD PTR [rsp+0x10],rdx             
    82c9:   48 8b 44 24 18          mov    rax,QWORD PTR [rsp+0x18]             
    82ce:   48 89 44 24 60          mov    QWORD PTR [rsp+0x60],rax             
    82d3:   48 8b 4c 24 10          mov    rcx,QWORD PTR [rsp+0x10]             
    82d8:   48 89 4c 24 68          mov    QWORD PTR [rsp+0x68],rcx             
    82dd:   48 8d 54 24 60          lea    rdx,[rsp+0x60]                       
    82e2:   48 8d 7c 24 30          lea    rdi,[rsp+0x30]                       
    82e7:   48 8b 74 24 20          mov    rsi,QWORD PTR [rsp+0x20]             
    82ec:   41 b8 02 00 00 00       mov    r8d,0x2                              
    82f2:   48 89 54 24 08          mov    QWORD PTR [rsp+0x8],rdx              
    82f7:   4c 89 c2                mov    rdx,r8                               
    82fa:   48 8b 4c 24 08          mov    rcx,QWORD PTR [rsp+0x8]              
    82ff:   41 b8 01 00 00 00       mov    r8d,0x1                              
    8305:   e8 46 fe ff ff          call   8150 <_ZN4core3fmt9Arguments6new_v117hd291308d219e9dddE>
    830a:   48 8d 7c 24 30          lea    rdi,[rsp+0x30]                       
    830f:   ff 15 7b fb 23 00       call   QWORD PTR [rip+0x23fb7b]        # 247e90 <_GLOBAL_OFFSET_TABLE_+0x538>
    8315:   48 83 c4 78             add    rsp,0x78                             
    8319:   c3                      ret                                         
    831a:   66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]

That is a lot of code. Remember that println! is a macro and it resolves to actual Rust code - which is not similar to a call to printf. Having this code in main pollutes it - we need to know exactly how variables are being stored and used. That is why, it was pushed into a different dummy function.

What else can be optimized here? We know that the dummy function can be optimized out. Lets take a look at code emitted with opt-level=1.

0000000000008200 <_ZN4uint4main17h8ca9df4d1f781c43E>:                           
    8200:   50                      push   rax                                  
    8201:   e8 0a 00 00 00          call   8210 <_ZN4uint5dummy17hc21d4f4c11081246E>
    8206:   58                      pop    rax                                  
    8207:   c3                      ret                                         
    8208:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]            
    820f:   00                                                                  
                                                                                
0000000000008210 <_ZN4uint5dummy17hc21d4f4c11081246E>:                          
    8210:   53                      push   rbx                                  
    8211:   48 83 ec 50             sub    rsp,0x50                             
    8215:   c7 44 24 0c 0a 00 00    mov    DWORD PTR [rsp+0xc],0xa              
    821c:   00                                                                  
    821d:   48 8d 35 1c f4 02 00    lea    rsi,[rip+0x2f41c]        # 37640 <_ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$3fmt17h2e6d06c5e3120a4fE>
    8224:   48 8d 7c 24 0c          lea    rdi,[rsp+0xc]                        
    8229:   e8 c2 fe ff ff          call   80f0 <_ZN4core3fmt10ArgumentV13new17h10447e08ba50c839E>
    822e:   48 89 44 24 10          mov    QWORD PTR [rsp+0x10],rax             
    8233:   48 89 54 24 18          mov    QWORD PTR [rsp+0x18],rdx             
    8238:   48 8d 5c 24 20          lea    rbx,[rsp+0x20]                       
    823d:   48 8d 74 24 10          lea    rsi,[rsp+0x10]                       
    8242:   48 89 df                mov    rdi,rbx                              
    8245:   e8 86 ff ff ff          call   81d0 <_ZN4core3fmt9Arguments6new_v117hdd5eea781ba10264E>
    824a:   48 89 df                mov    rdi,rbx                              
    824d:   ff 15 55 fc 23 00       call   QWORD PTR [rip+0x23fc55]        # 247ea8 <_GLOBAL_OFFSET_TABLE_+0x538>
    8253:   48 83 c4 50             add    rsp,0x50                             
    8257:   5b                      pop    rbx                                  
    8258:   c3                      ret                                         
    8259:   0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

Observe both the functions. You will notice that the local variable we defined in main is pushed into the dummy function. The dummy function looks shorter. x is stored in dummy’s stackframe at rsp+0xc. The obvious optimization is the removal of the dummy function. We push back all the code into main itself. There are three functions being called in dummy. Let us explore println! and macros in general in one of the future posts.

The following is the code generated with opt-level=2.

0000000000008380 <_ZN4uint4main17h8ca9df4d1f781c43E>:                           
    8380:   48 83 ec 48             sub    rsp,0x48                             
    8384:   c7 44 24 04 0a 00 00    mov    DWORD PTR [rsp+0x4],0xa              
    838b:   00                                                                  
    838c:   48 8d 44 24 04          lea    rax,[rsp+0x4]                        
    8391:   48 89 44 24 08          mov    QWORD PTR [rsp+0x8],rax              
    8396:   48 8d 05 43 f4 02 00    lea    rax,[rip+0x2f443]        # 377e0 <_ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$3fmt17h2e6d06c5e3120a4fE>
    839d:   48 89 44 24 10          mov    QWORD PTR [rsp+0x10],rax             
    83a2:   48 8d 05 cf d1 23 00    lea    rax,[rip+0x23d1cf]        # 245578 <anon.97aceed8034fd8a7f2dff7cb65913bba.0.llvm.245903583275124266+0x30>
    83a9:   48 89 44 24 18          mov    QWORD PTR [rsp+0x18],rax             
    83ae:   48 c7 44 24 20 02 00    mov    QWORD PTR [rsp+0x20],0x2             
    83b5:   00 00                                                               
    83b7:   48 c7 44 24 28 00 00    mov    QWORD PTR [rsp+0x28],0x0             
    83be:   00 00                                                               
    83c0:   48 8d 44 24 08          lea    rax,[rsp+0x8]                        
    83c5:   48 89 44 24 38          mov    QWORD PTR [rsp+0x38],rax             
    83ca:   48 c7 44 24 40 01 00    mov    QWORD PTR [rsp+0x40],0x1             
    83d1:   00 00                                                               
    83d3:   48 8d 7c 24 18          lea    rdi,[rsp+0x18]                       
    83d8:   ff 15 ca fa 23 00       call   QWORD PTR [rip+0x23faca]        # 247ea8 <_GLOBAL_OFFSET_TABLE_+0x538>
    83de:   48 83 c4 48             add    rsp,0x48                             
    83e2:   c3                      ret                                         
    83e3:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]          
    83ea:   00 00 00                                                            
    83ed:   0f 1f 00                nop    DWORD PTR [rax]

There is just one call, dummy is optimized out. x is stored at rsp+0x4. Most of the other code should be part of println!.

This was a rather trivial example - of an integer local variable. Most of the programming languages try to store as many variables as it can in the stack - allocation/deallocation is really fast. That depends on the lifetime of a variable - for how much time should a variable live?. Rust also stored the local variable in the stack/register - no different from C. Note that mut in Rust or const in C are both compile-time constructs. Using const in C and going with no mut in Rust also would have generated the same/very close code. Make the changes, checkout the emitted code.

2. u32 Implementations

What is interesting is the set of functions to play around with the u32 datatype. This page lists all the functions that are present to play around - swapping bytes, reversing bits, little-endian to big-endian conversion, getting an integer out of an array of bytes and many more.

In C, we have access to raw pointers which makes many of those operations really easy to implement. In Rust, we can’t access variables using raw pointers in safe mode. They can be implemented without pointers, but will be very inefficient. Let us take a couple of examples.

Let us start with a simple function: from_ne_bytes which takes an array of 4 bytes, considers it to be in native-endianess(this is the endianess of your processor) and converts it into an integer. Let us write a C program that does this.

rust/Rust-C-experiments/primitive-types > cat uint_impl.c
#include <stdio.h>
#include <stdint.h>

void dummy(uint32_t x);
uint32_t from_ne_bytes(uint8_t *arr);

int main()
{
    uint8_t arr[] = {0x12, 0x34, 0x56, 0x78};
    uint32_t val = from_ne_bytes(arr);
    dummy(val);
}

uint32_t from_ne_bytes (uint8_t *arr)
{
    return *(uint32_t *)(arr);
}

void dummy (uint32_t x)
{
    printf("%x\n", x);
}

Getting an integer from an array of bytes is that easy with pointers!

Let us take a look at the code emitted by gcc with -O0 level.

0000000000400560 <from_ne_bytes>:                                               
  400560:   55                      push   rbp                                  
  400561:   48 89 e5                mov    rbp,rsp                              
  400564:   48 89 7d f8             mov    QWORD PTR [rbp-0x8],rdi              
  400568:   48 8b 45 f8             mov    rax,QWORD PTR [rbp-0x8]              
  40056c:   8b 00                   mov    eax,DWORD PTR [rax]                  
  40056e:   5d                      pop    rbp                                  
  40056f:   c3                      ret

rdi has the pointer to the byte array. First it is stored in the stack(which can be optimized out). If you have seen assembly code before, you will recognize that the instruction mov eax,DWORD PTR [rax] is doing the job! Both the typecasting and dereferencing what we did in C is happening in this single instruction. The DWORD PTR requests the processor to consider the array pointer as a double-word pointer(1 word=2 bytes, dword=4 bytes which is the size of uint32_t) - which is the typecasting bit. The square brackets around rax is the dereferencing bit.

One obvious optimization is that the byte array pointer passed as argument doesn’t have to be stored in the stack. It can be used directly like this: mov eax, DWORD PTR [rdi]. Let us take a look at code emitted with -O1 flag.

000000000040052d <from_ne_bytes>:                                               
  40052d:   8b 07                   mov    eax,DWORD PTR [rdi]                  
  40052f:   c3                      ret

Let us continue and take a look at code generated with -O2 flag.

0000000000400440 <main>:                                                        
  400440:   be 12 34 56 78          mov    esi,0x78563412                       
  400445:   bf f0 05 40 00          mov    edi,0x4005f0                         
  40044a:   31 c0                   xor    eax,eax                              
  40044c:   e9 bf ff ff ff          jmp    400410 <printf@plt>

This is some insane optimization. It has optimized both the function calls, it has pre-calculated the integer and it is simply printing it.

Thats all it is. This is our frame of reference, pretty competitive!

Let us write the rust equivalent now.

rust/Rust-C-experiments/primitive-types > cat u32_impl.rs 
fn main()
{
    let mut arr: [u8; 4] = [0x12, 0x34, 0x56, 0x78];
    let mut val = u32::from_ne_bytes(arr);
    dummy(val);
}

fn dummy(mut x: u32)
{
    println!("{:X}", x);
}
``

Lets compile it with opt-level=0. First, let us take a look at ```main```, which is calling the ```from_ne_bytes()```.

```asm
00000000000082b0 <_ZN8u32_impl4main17ha82c952930c65191E>:                       
    82b0:   48 83 ec 18             sub    rsp,0x18                             
    82b4:   c6 44 24 10 12          mov    BYTE PTR [rsp+0x10],0x12             
    82b9:   c6 44 24 11 34          mov    BYTE PTR [rsp+0x11],0x34             
    82be:   c6 44 24 12 56          mov    BYTE PTR [rsp+0x12],0x56             
    82c3:   c6 44 24 13 78          mov    BYTE PTR [rsp+0x13],0x78             
    82c8:   8b 44 24 10             mov    eax,DWORD PTR [rsp+0x10]             
    82cc:   89 44 24 14             mov    DWORD PTR [rsp+0x14],eax             
    82d0:   8b 7c 24 14             mov    edi,DWORD PTR [rsp+0x14]             
    82d4:   e8 27 fe ff ff          call   8100 <_ZN4core3num21_$LT$impl$u20$u32$GT$13from_ne_bytes17haeb35db9bea809e2E>
    82d9:   89 44 24 0c             mov    DWORD PTR [rsp+0xc],eax              
    82dd:   8b 7c 24 0c             mov    edi,DWORD PTR [rsp+0xc]              
    82e1:   e8 0a 00 00 00          call   82f0 <_ZN8u32_impl5dummy17h41288951bda61844E>
    82e6:   48 83 c4 18             add    rsp,0x18                             
    82ea:   c3                      ret                                         
    82eb:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

The code is fairly straightforward.

The array is stored in the stack - in the following iinstructions.

    82b4:   c6 44 24 10 12          mov    BYTE PTR [rsp+0x10],0x12             
    82b9:   c6 44 24 11 34          mov    BYTE PTR [rsp+0x11],0x34             
    82be:   c6 44 24 12 56          mov    BYTE PTR [rsp+0x12],0x56             
    82c3:   c6 44 24 13 78          mov    BYTE PTR [rsp+0x13],0x78

Next comes the call to from_ne_bytes.

    82c8:   8b 44 24 10             mov    eax,DWORD PTR [rsp+0x10]             
    82cc:   89 44 24 14             mov    DWORD PTR [rsp+0x14],eax             
    82d0:   8b 7c 24 14             mov    edi,DWORD PTR [rsp+0x14]             
    82d4:   e8 27 fe ff ff          call   8100 <_ZN4core3num21_$LT$impl$u20$u32$GT$13from_ne_bytes17haeb35db9bea809e2E>

The first instruction in the above snippet actually does the job - considers the byte array pointer as a double-word pointer and copies it into eax register. Then that integer is stored back into the stack. After that, that integer is passed to the from_ne_bytes function there. So first point to note is that the function we wrote - it took an array of 4 bytes as argument. That has boiled down to the above four instructions.

If the first instruction already gets the integer we want, then what is the call to from_ne_bytes doing? Now, let us checkout its code.

0000000000008100 <_ZN4core3num21_$LT$impl$u20$u32$GT$13from_ne_bytes17haeb35db9bea809e2E>:
    8100:   48 83 ec 14             sub    rsp,0x14                             
    8104:   89 7c 24 08             mov    DWORD PTR [rsp+0x8],edi              
    8108:   8b 44 24 08             mov    eax,DWORD PTR [rsp+0x8]              
    810c:   89 44 24 04             mov    DWORD PTR [rsp+0x4],eax              
    8110:   8b 44 24 04             mov    eax,DWORD PTR [rsp+0x4]              
    8114:   89 44 24 0c             mov    DWORD PTR [rsp+0xc],eax              
    8118:   8b 44 24 0c             mov    eax,DWORD PTR [rsp+0xc]              
    811c:   89 44 24 10             mov    DWORD PTR [rsp+0x10],eax             
    8120:   8b 44 24 10             mov    eax,DWORD PTR [rsp+0x10]             
    8124:   89 04 24                mov    DWORD PTR [rsp],eax                  
    8127:   8b 04 24                mov    eax,DWORD PTR [rsp]                  
    812a:   48 83 c4 14             add    rsp,0x14                             
    812e:   c3                      ret                                         
    812f:   90                      nop

If you slowly observe the above function, nothing significant is happening. The integer passed to it(which is present in edi register) is stored and loaded a couple of times. Then it is finally loaded into eax - which is the register used to send back the return value - and it is sent back. I am not sure what these redundant loads-stores mean and what they are doing.

Let us take a look at code emitted with opt-level=1.

0000000000008200 <_ZN8u32_impl4main17ha82c952930c65191E>:                       
    8200:   50                      push   rax                                  
    8201:   e8 0a 00 00 00          call   8210 <_ZN8u32_impl5dummy17h41288951bda61844E>
    8206:   58                      pop    rax                                  
    8207:   c3                      ret                                         
    8208:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]            
    820f:   00                                                                  
                                                                                
0000000000008210 <_ZN8u32_impl5dummy17h41288951bda61844E>:                      
    8210:   53                      push   rbx                                  
    8211:   48 83 ec 50             sub    rsp,0x50                             
    8215:   c7 44 24 0c 12 34 56    mov    DWORD PTR [rsp+0xc],0x78563412       
    821c:   78                                                                  
    821d:   48 8d 35 7c f0 02 00    lea    rsi,[rip+0x2f07c]        # 372a0 <_ZN4core3fmt3num53_$LT$impl$u20$core..fmt..UpperHex$u20$for$u20$i32$GT$3fmt17h3127d8b68860fbfbE>
    8224:   48 8d 7c 24 0c          lea    rdi,[rsp+0xc]                        
    8229:   e8 c2 fe ff ff          call   80f0 <_ZN4core3fmt10ArgumentV13new17h3144ad7d58b84069E>
    822e:   48 89 44 24 10          mov    QWORD PTR [rsp+0x10],rax             
    8233:   48 89 54 24 18          mov    QWORD PTR [rsp+0x18],rdx             
    8238:   48 8d 5c 24 20          lea    rbx,[rsp+0x20]                       
    823d:   48 8d 74 24 10          lea    rsi,[rsp+0x10]                       
    8242:   48 89 df                mov    rdi,rbx                              
    8245:   e8 86 ff ff ff          call   81d0 <_ZN4core3fmt9Arguments6new_v117hdd5eea781ba10264E>
    824a:   48 89 df                mov    rdi,rbx                              
    824d:   ff 15 55 fc 23 00       call   QWORD PTR [rip+0x23fc55]        # 247ea8 <_GLOBAL_OFFSET_TABLE_+0x538>
    8253:   48 83 c4 50             add    rsp,0x50                             
    8257:   5b                      pop    rbx                                  
    8258:   c3                      ret                                         
    8259:   0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

We have seen this type of optimization before. Everything is pushed into the dummy function. The array is optimized out, the integer is pre-calculated and then stored in the stack. This is definitely better than the non-optimized one. With level2 optimization, everything is pushed into a single function.

The extra loads-stores present in unoptimized code needs to be investigated.

I want you to explore the swap_bytes function. Given a number 0x12345678, calling swap_bytes on it will return 0x78563412 - the bytes put in reverse direction. What do you is the best way to do it? Rust 1.47 uses the bswap x86 instruction to get the job done - which I believe is the fastest way to do it. In C, we can define a function and write inline assembly in it doing the same thing.

There are many functions like this. I think we need to see how it is implemented, evaluate its cost and then use it. These functions not only provide an abstraction over pointers etc., but also provides safety. There are functions whi

This short analysis can be extrapolated to i(16, 32, 64) and u(16, 64). I am curious as to how the 128-bit integer types are implemented because they are not native.

3. Arrays

This is going to be a really big subtopic - a lot to explore. We need to see how they are stored in memory, how they are accessed, iterating through them, functions around it.

3.1 How does it look like in memory?

Consider the following C program.

rust/Rust-C-experiments/primitive-types > cat array.c
#include <stdio.h>
#include <stdint.h>

#define ARR_SIZE 100

void dummy(int64_t *arr);

int main ()
{
    int64_t arr[ARR_SIZE] = {0};
    dummy(arr);

    return 0;
}

void dummy (int64_t *arr)
{   
    int i = 0;
    for(i = 0; i < ARR_SIZE; i++)
    {
        printf("%d ", arr[i]);
    }
}

Compile it with lowest optimization and get its disassembly.

rust/Rust-C-experiments/primitive-types > gcc array.c -o array_c0 -O0
rust/Rust-C-experiments/primitive-types > objdump -Mintel -D array_c0 > array_c0.obj

The following is the main function.

000000000040052d <main>:                                                        
  40052d:   55                      push   rbp                                  
  40052e:   48 89 e5                mov    rbp,rsp                              
  400531:   48 81 ec 20 03 00 00    sub    rsp,0x320                            
  400538:   48 8d b5 e0 fc ff ff    lea    rsi,[rbp-0x320]                      
  40053f:   b8 00 00 00 00          mov    eax,0x0                              
  400544:   ba 64 00 00 00          mov    edx,0x64                             
  400549:   48 89 f7                mov    rdi,rsi                              
  40054c:   48 89 d1                mov    rcx,rdx                              
  40054f:   f3 48 ab                rep stos QWORD PTR es:[rdi],rax             
  400552:   48 8d 85 e0 fc ff ff    lea    rax,[rbp-0x320]                      
  400559:   48 89 c7                mov    rdi,rax                              
  40055c:   e8 07 00 00 00          call   400568 <dummy>                       
  400561:   b8 00 00 00 00          mov    eax,0x0                              
  400566:   c9                      leave                                       
  400567:   c3                      ret

Go through it and try to understand what each line does.

The lea rsi,[rbp-0x320] loads the array’s starting address to rsi register. The interesting part of the zeroizing of that array. In the C program, we have initialized the whole array to 0 - int64_t arr[100] = {0}. In C, we simply write 1 zero in the flower brackets and finish it. But how does it actually happen at runtime? It needs to zeroize 8 * 100 = 800 bytes of stack memory. Take a look at the following instructions.

  40053f:   b8 00 00 00 00          mov    eax,0x0
  400544:   ba 64 00 00 00          mov    edx,0x64                             
  400549:   48 89 f7                mov    rdi,rsi                              
  40054c:   48 89 d1                mov    rcx,rdx                              
  40054f:   f3 48 ab                rep stos QWORD PTR es:[rdi],rax 

These 4 instructions gets the job done. rcx is initialized to 100. rax is loaded to 0. rdi is loaded with array’s starting address. Then rep stos QWORD PTR es:[rdi],rax is executed. Let us checkout what this instruction does. The stos OR the **Store Stringinstruction stores the 8 bytes in rax into 8 bytes pointed by address inrdi.rdiis incremented by 8,rcxis decremented by 1.repstands forrepeatwhich is the **prefix** to thestosinstruction. Thisrepprefix brings in the loop functionality. The loop ends whenrcx``` hits 0. The following is the C-style code of the above 5 instructions.

int64_t rcx = 100;
int64_t *rdi = arr;
int64_t rax = 0;

while(rcx > 0)
{
    *rdi = rax;
    rdi += 1;   /* Note that rdi is incremented by 8 and not 1 */
    rcx -= 1;
}

Really cool isn’t it! Two things about this amazes me. One is that such a complex instruction is present in x64 and the second is that gcc is using it. Using underlying instructions to get the job done is probably the fastest way to do it. Even with -O2 flag, the exact same instructions are used to zeroize it.

Now we have our frame of reference, again very competitive.

Lets look at the rust program.

rust/Rust-C-experiments/primitive-types > cat array.rs
fn main ()
{
    let mut arr: [i64; 100] = [0; 100];
    dummy(arr);
}

fn dummy (mut arr: [i64; 100])
{   
    let mut i = 0;
    while i < 100
    {
        println!("{}", arr[i]);
        i += 1;
    }
}
rust/Rust-C-experiments/primitive-types > rustc array.rs -o array_rs0 -C opt-level=0

In this subsection, let us focus on how the array is stored, how it is zeroized etc., Lets take a look at main’s disassembly.

0000000000008360 <_ZN5array4main17h6cec91fda9225fadE>:                          
    8360:   48 81 ec 58 06 00 00    sub    rsp,0x658                            
    8367:   48 8d 44 24 18          lea    rax,[rsp+0x18]                       
    836c:   31 f6                   xor    esi,esi                              
    836e:   48 89 c1                mov    rcx,rax                              
    8371:   48 89 cf                mov    rdi,rcx                              
    8374:   b9 20 03 00 00          mov    ecx,0x320                            
    8379:   48 89 ca                mov    rdx,rcx                              
    837c:   48 89 44 24 10          mov    QWORD PTR [rsp+0x10],rax             
    8381:   48 89 4c 24 08          mov    QWORD PTR [rsp+0x8],rcx              
    8386:   e8 25 fc ff ff          call   7fb0 <memset@plt>

Damn! It is using memset to zeroize the array. How fast/slow is memset compared to our assembly-implementation we saw before? Let us checkout memset’s code.

gdb-peda$ disass memset
Dump of assembler code for function memset:
=> 0x00007ffff7df5a40 <+0>:     mov    rcx,rdx
   0x00007ffff7df5a43 <+3>:     movzx  eax,sil
   0x00007ffff7df5a47 <+7>:     mov    rdx,rdi
   0x00007ffff7df5a4a <+10>:    rep stos BYTE PTR es:[rdi],al
   0x00007ffff7df5a4c <+12>:    mov    rax,rdx
   0x00007ffff7df5a4f <+15>:    ret    
End of assembler dump.

I used gdb on a program and disassembled memset - which is part of libc. And look at its implementation! It also uses the assembly-implementation, but there are sutle differences between this and the code we saw in the C program which matter - when it comes to performance.

Notice the stos instruction: rep stos BYTE PTR es:[rdi],al. Here, zeroization is done byte by byte - al (which is one byte) is copied into the byte pointed by rdi. rdi is incremented by 1, rcx is decremented by 1. But in the above C-emitted code, zeroization happened in batch of 8 bytes. Because the C compiler knows that the sizeof(int64_t) is 8 bytes, it generated the instructions accordingly. In C, entire zeroization happens in 100 iterations, but here it takes 800 iterations.

I don’t want to infer that the C-emitted code is faster than memset - just by looking at the number of iterations. Because we don’t know how hardware implements this looping and the stos instruction. The stos QWORD PTR es:[rdi], rax which copies 8 bytes at once - might be getting resolved to simpler microcode which does a byte-level copy - not sure. Once the main content of this post is over, let us look at how we can measure the two variants.

Moving forward, the following is the second half of main.

    838b:   48 8d 84 24 38 03 00    lea    rax,[rsp+0x338]                      
    8392:   00                                                                  
    8393:   48 89 c1                mov    rcx,rax                              
    8396:   48 8b 54 24 10          mov    rdx,QWORD PTR [rsp+0x10]             
    839b:   48 89 cf                mov    rdi,rcx                              
    839e:   48 89 d6                mov    rsi,rdx                              
    83a1:   48 8b 54 24 08          mov    rdx,QWORD PTR [rsp+0x8]              
    83a6:   48 89 04 24             mov    QWORD PTR [rsp],rax                  
    83aa:   e8 51 fc ff ff          call   8000 <memcpy@plt>                    
    83af:   48 8b 3c 24             mov    rdi,QWORD PTR [rsp]                  
    83b3:   e8 08 00 00 00          call   83c0 <_ZN5array5dummy17h343f70d551062a4dE>
    83b8:   48 81 c4 58 06 00 00    add    rsp,0x658                            
    83bf:   c3                      ret

Thats not it. Before dummy is called, a call to memcpy happens. What is that doing? If you read through the assembly code, you’ll see that a copy of the array is made and that is being passed to dummy. This looks like a pass by value of an array - which is not directly possible in C. In C, structures can be passed by value, but not arrays. Arrays can only be passed by reference. Let us examine the code a bit closely. Here, Rust is behaving like a high-level language. We can pass an array by value. But even there, a reference to the copy-array is passed to dummy. The following C code roughly mimics the Rust code.

int main()
{
    int64_t arr[100] = {0};
    int64_t arr_copy[sizeof(arr)];
    
    memcpy(arr_copy, arr, sizeof(arr_copy));
    dummy(arr_copy);
}

Rust’s pass by value here is the same as making a copy of the array and passing that copy as reference in the way we do it in C.

The following is the modified version of array.rs.

rust/Rust-C-experiments/primitive-types > cat array.rs
fn main ()
{
    let mut arr: [i64; 100] = [0; 100];
    dummy(&mut arr);
}

fn dummy (arr: &mut [i64; 100])
{   
    let mut i = 0;
    while i < 100
    {
        println!("{}", arr[i]);
        i += 1;
    }
}

With opt-level=0, the following code is generated.

0000000000008370 <_ZN5array4main17h6cec91fda9225fadE>:                          
    8370:   48 81 ec 28 03 00 00    sub    rsp,0x328                            
    8377:   48 8d 44 24 08          lea    rax,[rsp+0x8]                        
    837c:   31 f6                   xor    esi,esi                              
    837e:   48 89 c1                mov    rcx,rax                              
    8381:   48 89 cf                mov    rdi,rcx                              
    8384:   ba 20 03 00 00          mov    edx,0x320                            
    8389:   48 89 04 24             mov    QWORD PTR [rsp],rax                  
    838d:   e8 3e fc ff ff          call   7fd0 <memset@plt>                    
    8392:   48 8b 3c 24             mov    rdi,QWORD PTR [rsp]                  
    8396:   e8 15 00 00 00          call   83b0 <_ZN5array5dummy17hab521586f71a1064E>
    839b:   48 81 c4 28 03 00 00    add    rsp,0x328                            
    83a2:   c3                      ret                                         
    83a3:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]          
    83aa:   00 00 00                                                            
    83ad:   0f 1f 00                nop    DWORD PTR [rax]

Rust’s pass by reference here is the same as C’s pass by reference.

With opt-level=1, there is not much difference here - in the main function. With opt-level=2, it optimized out the dummy function and put it inside main itself which did not happen with gcc.

I think this can be extended to arrays of i(8, 16, 32) and u(8, 16, 32, 64) declared inside a function - as a local variable.

Now comes the interesting, juicy part - array access.

3.2 Array access

In this section, we see how array access happens. Rust is supposed to be a safer language. It panics when an out-of-bounds array access is made. How does it work at runtime? Lets explore.

Lets take a look at first few lines of dummy function.

00000000000083b0 <_ZN5array5dummy17hab521586f71a1064E>:                         
    83b0:   48 81 ec 88 00 00 00    sub    rsp,0x88                             
    83b7:   48 c7 44 24 38 00 00    mov    QWORD PTR [rsp+0x38],0x0             
    83be:   00 00                                                               
    83c0:   48 89 7c 24 30          mov    QWORD PTR [rsp+0x30],rdi             
    83c5:   48 83 7c 24 38 64       cmp    QWORD PTR [rsp+0x38],0x64            
    83cb:   72 08                   jb     83d5 <_ZN5array5dummy17hab521586f71a1064E+0x25>
    83cd:   48 81 c4 88 00 00 00    add    rsp,0x88                             
    83d4:   c3                      ret                                         

The code is fairly simple. Some memory is allocated in the stack and the local variable is initialized to 0. The pointer to the array is stored in the stack. Then comes the following three instructions.

    83c5:   48 83 7c 24 38 64       cmp    QWORD PTR [rsp+0x38],0x64            
    83cb:   72 08                   jb     83d5 <_ZN5array5dummy17hab521586f71a1064E+0x25>
    83cd:   48 81 c4 88 00 00 00    add    rsp,0x88                             
    83d4:   c3                      ret

We are comparing i with 100(or 0x64). If it is not below/lesser, then we return. If it is below/lesser, the jb or jump-if-below instruction succeeds and it jumps to the instruction present after ret. So, it is safe to assume that this comparison is the one present as part of the while loop.

Let us take a look at the next couple of lines.

    83d5:   48 8b 05 7c d1 23 00    mov    rax,QWORD PTR [rip+0x23d17c]        # 245558 <__dso_handle+0x58>
    83dc:   48 8b 4c 24 38          mov    rcx,QWORD PTR [rsp+0x38]             
    83e1:   48 83 f9 64             cmp    rcx,0x64                             
    83e5:   0f 92 c2                setb   dl                                   
    83e8:   f6 c2 01                test   dl,0x1                               
    83eb:   48 89 44 24 28          mov    QWORD PTR [rsp+0x28],rax             
    83f0:   48 89 4c 24 20          mov    QWORD PTR [rsp+0x20],rcx             
    83f5:   75 05                   jne    83fc <_ZN5array5dummy17hab521586f71a1064E+0x4c>
    83f7:   e9 a8 00 00 00          jmp    84a4 <_ZN5array5dummy17hab521586f71a1064E+0xf4>

Do you notice something unusual? The local variable is again being compared with 100.

These lines use non-common instructions like setb and test. Let us first understand what they do and then come back to understanding why another check is present.

When the cmp instruction executed, it may set flags(in the eflags register) based on the result of comparison. Based on what flags are set(=1) and not set(=0), we can take decisions - do we come out of the loop, should we come out of that if block etc., Here, when rcx is greated than equal to 100, the Carry Flag is set(to 1) in the eflags register. The setb copies the Carry bit(either 0 or 1) into the dl sub-register. test basically compares two operands. Why is this needed? Let us take the case when the Carry-bit is 1.

We know that the carry bit is set only when the first operand(here rcx or our iterator variable i). If it is 1, then i is greater than or equal to 100. The Carry-bit (1) is copied into dl using the setb instruction. The test will pass and we jmp to the address 0x84a4. Let us see what is present at this address.

    84a4:   48 8d 15 b5 d0 23 00    lea    rdx,[rip+0x23d0b5]        # 245560 <__dso_handle+0x60>
    84ab:   48 8d 05 4e ad 02 00    lea    rax,[rip+0x2ad4e]        # 33200 <_ZN4core9panicking18panic_bounds_check17h2e8c50d2fb4877c0E>
    84b2:   be 64 00 00 00          mov    esi,0x64                             
    84b7:   48 8b 7c 24 20          mov    rdi,QWORD PTR [rsp+0x20]             
    84bc:   ff d0                   call   rax

There you go! the core::panicking::panic_bounds_check gets called.

We wrote only one check which we encountered above. This comparison is the code added by the compiler- to make sure the index is always in bounds. If you look at the function, you will notice that this check is done before every array access. Here, there are 100 elements, so there are 100 out-of-bound checks along with the traditional loop-ending check we have put.

This is not apparent from our code - because our while loop is a good and ends when i hits 100. So both the cmps compares i with 100. Let us change the loop-ending check to 200 and come back to the assembly code. The first few lines look like the following.

00000000000083b0 <_ZN5array5dummy17hab521586f71a1064E>:                         
    83b0:   48 81 ec 88 00 00 00    sub    rsp,0x88                             
    83b7:   48 c7 44 24 38 00 00    mov    QWORD PTR [rsp+0x38],0x0             
    83be:   00 00                                                               
    83c0:   48 89 7c 24 30          mov    QWORD PTR [rsp+0x30],rdi             
    83c5:   48 81 7c 24 38 c8 00    cmp    QWORD PTR [rsp+0x38],0xc8            
    83cc:   00 00                                                               
    83ce:   72 08                   jb     83d8 <_ZN5array5dummy17hab521586f71a1064E+0x28>
    83d0:   48 81 c4 88 00 00 00    add    rsp,0x88                             
    83d7:   c3                      ret                                         
    83d8:   48 8b 05 79 d1 23 00    mov    rax,QWORD PTR [rip+0x23d179]        # 245558 <__dso_handle+0x58>
    83df:   48 8b 4c 24 38          mov    rcx,QWORD PTR [rsp+0x38]             
    83e4:   48 83 f9 64             cmp    rcx,0x64                             
    83e8:   0f 92 c2                setb   dl                                   
    83eb:   f6 c2 01                test   dl,0x1                               
    83ee:   48 89 44 24 28          mov    QWORD PTR [rsp+0x28],rax             
    83f3:   48 89 4c 24 20          mov    QWORD PTR [rsp+0x20],rcx             
    83f8:   75 05                   jne    83ff <_ZN5array5dummy17hab521586f71a1064E+0x4f>
    83fa:   e9 a8 00 00 00          jmp    84a7 <_ZN5array5dummy17hab521586f71a1064E+0xf7>

The first cmp is checking against 0xc8 (which is 200). The second cmp is checking 0x64/100 - which is the length of the array. The C-equivalent code would look like this.

while(i < 200)
{
    assert(i < 100);
    printf("%ld\n", i);
    i += 1;
}

If an assert fails, it will kill(or abort) the process with a SIGABRT.

I hope you are able to appreciate how simple the bound-checking code is - just what is needed.

Out-of-bound read/write can be deadly at times. It is illegal and it should not happen. You can take a look at this list of out-of-bound reads/writes found in various software.

Once this index is checked, the array element is printed.

There is one fundamental point to understand here. What are we trying to do here? We are iterating through an array with 100 elements. What did the compiler understand from our code? There is a loop which goes from 0 to 99. That loop has a body. That body has one line - which is an array access. Here, the looping and array access are independent of each other. That is probably why it generated bound-checking code. But what does iterating over an array mean? It means I go over the array and definitely won’t go below/beyond it - that is the definition. We are not conveying that idea to the compiler with a while loop and an array access inside it. How can we explicitly tell the compiler that we will be iterating through the array ie., we have no intention of going beyond the array? Consider the iter() abstraction. Change array.rs to use iter() instead of the while loop.

fn dummy (arr: &mut [i64; 100])
{   
    for val in arr.iter()
    {
        println!("{}", val);
    }
}

Run the program, make sure its working as intended.

Checkout dummy’s code. You will see that there is no additional check, no panic-related code in the function.

But what about optimized versions of the code with while-loop? The optimized version had no bound-checking code when we looped from 0-99. I guess the compiler is intelligent enough. But I think it is good to convey what exactly you are doing, what you want by using the right constructs for the job, especially when the compiler understands it very well.

That was a short introduction to arrays.

4. Tuple

Of all the primitive datatypes, I find Tuple to be the most interesting one. Arrays - it is an array or a collection of objects of the same type. But a rust’s tuple have members of different datatypes. Even a structure can have members of different datatypes, but the difference is that in a tuple, you can access members by its position in the tuple but we cannot iterate through it the way we do for arrays. It will be interesting to see how a tuple looks like at assembly level, how it works.

Lets start with a simple program which initialized a tuple and prints it.

rust/Rust-C-experiments/primitive-types > cat tuple.rs
fn main()
{
    let tuple = ("Hello", 5, 'c');
    dummy(&tuple);
}

fn dummy(tuple: &(&str, i32, char))
{
    println!("({}, {}, {})", tuple.0, tuple.1, tuple.2);        
}

Compile it and run it. Make sure it givs the intended output. Also get its disassembly.

0000000000008420 <_ZN5tuple4main17hcb4c93a240c6cf54E>:                          
    8420:   48 83 ec 18             sub    rsp,0x18                             
    8424:   48 8d 05 35 02 03 00    lea    rax,[rip+0x30235]        # 38660 <_fini+0x10>
    842b:   48 89 04 24             mov    QWORD PTR [rsp],rax                  
    842f:   48 c7 44 24 08 05 00    mov    QWORD PTR [rsp+0x8],0x5              
    8436:   00 00                                                               
    8438:   c7 44 24 10 05 00 00    mov    DWORD PTR [rsp+0x10],0x5             
    843f:   00                                                                  
    8440:   c7 44 24 14 63 00 00    mov    DWORD PTR [rsp+0x14],0x63            
    8447:   00                                                                  
    8448:   48 89 e7                mov    rdi,rsp                              
    844b:   e8 10 00 00 00          call   8460 <_ZN5tuple5dummy17h36779743d02e81f4E>
    8450:   48 83 c4 18             add    rsp,0x18                             
    8454:   c3                      ret                                         
    8455:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]          
    845c:   00 00 00                                                            

The second instruction lea rax,[rip+0x30235] is actually fetching the string literal’s "Hello"’s address. Let us checkout what is present at 0x38660.

0000000000038660 <str.3-0x1310>:                                                
   38660:   48                      rex.W                                       
   38661:   65 6c                   gs ins BYTE PTR es:[rdi],dx                 
   38663:   6c                      ins    BYTE PTR es:[rdi],dx                 
   38664:   6f                      outs   dx,DWORD PTR ds:[rsi]                
   38665:   28 2c 20                sub    BYTE PTR [rax+riz*1],ch              
   38668:   29 0a                   sub    DWORD PTR [rdx],ecx                  
   3866a:   00 00                   add    BYTE PTR [rax],al

Do not care about those assembly instructions - they don’t have any value. When objdump is specified with the -D option, it disassembles the complete files - strings, tables etc., whatever is present in the binary. What we are seeing is the disassembly of a string - which does not make sense.

If you look at an ascii conversion table, you will see that the Hello string is present there.

Coming back to the disassembly, right after the string literal’s address is stored, its size(= 5) is also stored. Then comes the integer 0x5 and then the character - ‘c’. A tuple of type (&str, i32, char) looks like the following in memory.

X   : String literal's address (8 bytes)
X+8 : String literal's size (8 bytes)
X+16: i32  (4 bytes)
X+20: char (4 bytes)

And then we are passing a reference of the tuple to the function. From the above example, X is being passed to the function. If you have X(the starting address of the tuple) and the the tuple’s prototype, then one can access the tuple from memory without any ambiguity.

This is how the elements are accessed in the dummy function.

0000000000008460 <_ZN5tuple5dummy17h36779743d02e81f4E>:                         
    8460:   48 81 ec c8 00 00 00    sub    rsp,0xc8                             
    8467:   48 8b 05 0a d1 23 00    mov    rax,QWORD PTR [rip+0x23d10a]        # 245578 <__dso_handle+0x78>
    846e:   48 89 f9                mov    rcx,rdi                              
    8471:   48 89 fa                mov    rdx,rdi                              
    8474:   48 81 c2 10 00 00 00    add    rdx,0x10                             
    847b:   48 81 c7 14 00 00 00    add    rdi,0x14                             
    8482:   48 89 8c 24 b0 00 00    mov    QWORD PTR [rsp+0xb0],rcx             
    8489:   00                                                                  
    848a:   48 89 94 24 b8 00 00    mov    QWORD PTR [rsp+0xb8],rdx             
    8491:   00                                                                  
    8492:   48 89 bc 24 c0 00 00    mov    QWORD PTR [rsp+0xc0],rdi             
    8499:   00                                                                  

The register rdi points to the tuple. tuple.0 should give you access to the string literal. Because rdi points to the tuple and because the string literal is the first element, rdi points to the tuple. That is loaded into rcx``. Thenrdxis loaded withrdi+0x10which basically points to thetuple.1- the second tuple element.rdiitself is incremented by 0x14/20 bytes - after increment points to the tuple's third member. Next few lines are related toprintln!```.

The way the tuple is laid down in memory or the way it’s members are accessed is very straightforward. Let us take a look at code emitted with opt-level=1. The following is main’s code.

0000000000008270 <_ZN5tuple4main17hcb4c93a240c6cf54E>:                          
    8270:   48 83 ec 18             sub    rsp,0x18                             
    8274:   48 8d 05 85 01 03 00    lea    rax,[rip+0x30185]        # 38400 <_fini+0x10>
    827b:   48 89 04 24             mov    QWORD PTR [rsp],rax                  
    827f:   48 c7 44 24 08 05 00    mov    QWORD PTR [rsp+0x8],0x5              
    8286:   00 00                                                               
    8288:   48 b8 05 00 00 00 63    movabs rax,0x6300000005                     
    828f:   00 00 00                                                            
    8292:   48 89 44 24 10          mov    QWORD PTR [rsp+0x10],rax             
    8297:   48 89 e7                mov    rdi,rsp                              
    829a:   e8 11 00 00 00          call   82b0 <_ZN5tuple5dummy17h36779743d02e81f4E>
    829f:   48 83 c4 18             add    rsp,0x18                             
    82a3:   c3                      ret                                         
    82a4:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]          
    82ab:   00 00 00                                                            
    82ae:   66 90                   xchg   ax,ax

The string literal’s address and its size is stored in the stack. The compiler has recognized that the second and third members are integer and char. Integer ix 5 OR 0x0000005. char is 0x00000063. Both are combined and stored using a single mov instruction. No changes in the memory layout of a tuple - just the way it is done has changed a little bit. The pointer to it is sent to dummy by loading it into rdi. The code emitted by opt-level=2 optimized out the dummy function.

Tuples can be really long(but its length shoul be known at compile-time), there can be tuple of tuples - any combination.

With that, we end our exploration on primitive datatypes.

5. A couple of interesting things

5.1 Zeroization: memset vs. emitted-C code

When we were exploring arrays, we saw arrays were zeroized in C and in Rust. The C compiler made use of specific string-based x86 instructions to zeroize the array. Rust simply called memset. For an int64_t array, the following is the zeroization code emitted by gcc.

  40053f:   b8 00 00 00 00          mov    eax,0x0                              
  400544:   ba 64 00 00 00          mov    edx,0x64                             
  400549:   48 89 f7                mov    rdi,rsi                              
  40054c:   48 89 d1                mov    rcx,rdx                              
  40054f:   f3 48 ab                rep stos QWORD PTR es:[rdi],rax

The following is how memset zeroizes an array.

=> 0x00007ffff7df5a40 <+0>:     mov    rcx,rdx                                  
   0x00007ffff7df5a43 <+3>:     movzx  eax,sil                                  
   0x00007ffff7df5a47 <+7>:     mov    rdx,rdi                                  
   0x00007ffff7df5a4a <+10>:    rep stos BYTE PTR es:[rdi],al                   
   0x00007ffff7df5a4c <+12>:    mov    rax,rdx                                  
   0x00007ffff7df5a4f <+15>:    ret

We saw that the C version zeroized the entire array in batches of 8 bytes - basically copied the contents of rax register into memory. But in memset, al - which is 1 byte in size is copied. For an int64_t array of 100 elements(a total of 800 bytes), the C version would be done in 100 iterations, but memset would take 800 iterations - because it is zeroizing one byte at a time. Just by comparing the iterations, we cannot infer that the C version is faster - because lesser number of iterations. Because we don’t know how the processor implements the stos instruction, we need a better way to evaluate their speeds. To do this, let us write two assembly programs - both will be zeroizing an array of 10,000 bytes. But the first will be using the C version, second will use memset code. We won’t be calling memset, instead let us use those instructions. We can see how much time each program takes.

The following is the first program.

rust/Rust-C-experiments/primitive-types > cat zeroize_c.s                                                                                                                                                   
#
# zeroize_c.s  
#             
# C style zeroizing - in batches
#                              
    .global  _start          
    .text
_start:

    # Make a loop which will keep calling zeroize
    mov $1000000, %r15
z_loop:
    mov $100000, %rdi   # Buffer size in bytes
    call zeroize
    dec %r15
    cmp $0, %r15
    jnz z_loop

exit:
    mov $60, %rax       # exit's system call number
    mov $0, %rbx        # exit(0)
    syscall

zeroize:
    push %rbp
    mov %rsp, %rbp
    sub %rdi, %rsp      # Allocate memory on the stack
    # memset style zeroization code
    mov $0, %rax        # Zeroize
    mov $12500, %rcx    # Number of iterations = number of bytes/8
    mov %rsp, %rdi      # Move the starting address into rdi
    rep stosq           # 8-bytes/quad-word at a time

    leave
    ret

All you need to focus on is the code under the zeroize label. The assembly-syntax used here is the AT&T syntax but what we saw in objdump is called the Intel syntax(-Mintel option). A stack buffer of 1,00,000 bytes is allocated and zeroized 8 bytes at a time. Just calling the zeroize function got over in a couple of milliseconds. To compare time, either we have to increase the buffer size or keep calling zeroize again and again. Increasing buffer size was not feasible. The second option is a feasible one. We call zeroize 1,000,000 times.

Let us compile and run it.

rust/Rust-C-experiments/primitive-types > gcc zeroize_c.s -o zeroize_c -nostdlib

Notice that we are not linking our program against libc because we don’t need it. Run it and measure the time.

t/Rust-C-experiments/primitive-types > time ./zeroize_c

real    0m1.354s
user    0m1.353s
sys     0m0.001s

Now consider the memset version.

#
# zeroize_memset.s                                                                                                                                                                                                
# 
# memset style zeroizing - byte size zeroizing. 
#
    .global  _start                                                                                                                                                                               
    .text
_start:

    # Make a loop which will keep calling zeroize
    mov $1000000, %r15
z_loop:
    mov $100000, %rdi   # Buffer size in bytes
    call zeroize
    dec %r15
    cmp $0, %r15
    jnz z_loop

exit:
    mov $60, %rax       # exit's system call number
    mov $0, %rbx        # exit(0)
    syscall

zeroize:
    push %rbp
    mov %rsp, %rbp
    sub %rdi, %rsp      # Allocate memory on the stack

    # memset style zeroization code
    mov $0, %rax        # Zeroize
    mov %rdi, %rcx      # Number of iterations = number of bytes
    mov %rsp, %rdi      # Move the starting address into rdi
    rep stosb           # byte wise

    leave
    ret

You can see that the only difference is that here we are doing a byte-wise zeroization. Compile and time it.

rust/Rust-C-experiments/primitive-types > time ./zeroize_memset 

real    0m1.403s
user    0m1.401s
sys     0m0.002s

Run both of them a couple of times. What I observed was that the times were very close and similar. From this, I will assume that both are of the same speed. This makes me wonder how the rep stos instruction is implemented in the processor.

6. Conclusion

So far, we have explored a couple of primitive datatypes, how they are stored in memory when defined in a function, how they are accessed - how the code looks when compared to C emitted code. What we have explored here is a very small portion of possibilities. There are different integer datatypes, there are floating datatypes(which we didn’t explore), there are arrays - array of different other primitive datatypes(including of array of tuples), tuple of arrays - any of those possibilities. With the basic analysis done so far, you should be able get a rough idea of how the rustc-emitted code might look like, how a construct looks like in memory and how the corresponding emitted code looks like. Only way to explore more is to try out wierd combinations, lot of examples, expect how it would look like in memory - then checkout the assembly code and compare your analysis with what is present.

I ignored a couple of topics - like function calls, how arguments are passed, pass-by-value, pass-by-reference etc., even though we encountered them. Let us explore all these along with other constructs and features in future posts.