Primitive Types
Rust offers a bunch of primitive datatypes - integer variants(signed, unsigned, 8/16/32/64/128 bit, native(i/usize)), char, boolean, string literal, array, tuple and slices.
In this post, we will be exploring a few of those types - integer types, array and tuple.
If you don’t have rust toolchain installed in your system, you can go through this to create a virtual environment and install the rust toolchain there.
All code used/written in this post is present here.
0. Code optimization
Before we explore the datatypes, let us look at the different optimization levels present in rustc as well as in gcc.
The Rust compiler offers three levels of optimizations - the same way gcc offers three levels. Let us take an example and go through the three levels.
Consider the following C program.
rust/Rust-C-experiments/primitive-types > cat opt_example.c
#include <stdio.h>
#include <stdint.h>
int main()
{
uint32_t x = 10;
return 0;
}
I am using gcc-4.8.5. Let us compile it with no optimization (or optimization level 0) and get its disassembly.
rust/Rust-C-experiments/primitive-types > gcc opt_example.c -o opt_example_c0 -O0
rust/Rust-C-experiments/primitive-types > objdump -Mintel -D opt_example_c0 > opt_example_c0.obj
The following is main
’s assembly code.
00000000004004ed <main>:
4004ed: 55 push rbp
4004ee: 48 89 e5 mov rbp,rsp
4004f1: c7 45 fc 0a 00 00 00 mov DWORD PTR [rbp-0x4],0xa
4004f8: b8 00 00 00 00 mov eax,0x0
4004fd: 5d pop rbp
4004fe: c3 ret
4004ff: 90 nop
The code generated goes well with the general understanding that a local variable is stored in the stack. The local variable x
is placed in the stack at the memory location rbp-0x4
. Then 10(or 0xa) is loaded into it. Then the function returns. This assembly code is an exact equivalent of our C code.
Now, let us compile the code with -O1 optimization level and check.
00000000004004ed <main>:
4004ed: b8 00 00 00 00 mov eax,0x0
4004f2: c3 ret
4004f3: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
4004fa: 00 00 00
4004fd: 0f 1f 00 nop DWORD PTR [rax]
This code is almost nothing. It’ll load 0x0 into eax - which is the assembly equivalent of return 0
. The local variable x
is not seen here. Is this valid code?
Anyway x
is not being used anywhere. Why should it be given memory and why should it be initialized? That is what the compiler is thinking. This actually looks like the most optimized code. What else can be done?
Take a look at the assembly code generated by -O2 optimization level.
0000000000400400 <main>:
400400: 31 c0 xor eax,eax
400402: c3 ret
Is there an optimization here? The mov
is replaced with an xor
instruction. How is that an optimization? End goal is to zeroize the eax
register. How it is done is left to the compiler. xor
instruction is a much faster instruction than mov
. One more aspect is the instruction size. The mov
instruction took is 5 bytes long whereas xor
instruction is 2 bytes long. You load less number of bytes to do the same work.
For gcc 4.8.5, even -O10 is a valid optimization level. gcc didn’t return any error for using -O10. But the code was same as -O2.
Now that we have setup a frame of reference to measure the code emitted by the Rust compiler, let us take the following rust program.
rust/Rust-C-experiments/primitive-types > cat opt_example.rs
fn main()
{
let x: u32 = 10;
}
Let us compile it with optimization level 0 and get its disassembly.
rust/Rust-C-experiments/primitive-types > rustc opt_example.rs -o opt_example_rs0 -C opt-level=0
warning: unused variable: `x`
--> opt_example.rs:3:9
|
3 | let x: u32 = 10;
| ^ help: if this is intentional, prefix it with an underscore: `_x`
|
= note: `#[warn(unused_variables)]` on by default
warning: 1 warning emitted
Get its disassembly.
rust/Rust-C-experiments/primitive-types > objdump -Mintel -D opt_example_rs0 > opt_example_rs0.obj
The following is rust main
’s disassembly.
0000000000007cc0 <_ZN11opt_example4main17h268d7c689877f0faE>:
7cc0: c3 ret
7cc1: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
7cc8: 00 00 00
7ccb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
This is the code emitted with opt-level=0. This has optimized out the variable x
. There is a simple return and nothing else. For the main
function, there is nothing else to optimize. So, opt-level=1,2 emit the same code.
I conveniently ignored all the nop
(No OPeration) instructions present after the ret
. A function is no more after it executes the ret`` instruction, but why are there
nops after
ret```? Aligning functions, loops at a certain byte-boundary like 16/32 bytes is also an optimization. This stackoverflow answer gives a good explanation. This ofcourse increases the size of the binary a little bit.
What we saw is a very naive, simple example of compiler optimization. Present day compilers are extremely intelligent and have various types of optimizations. As we move forward, we will write larger programs, with different language constructs where we get to see various different optimizations.
Optimizations can be so aggressive that the emitted code may not resemble the code you have written. So, get ready for optimizations and surprises. When you see an optimized version, you should compare it with the code you have written and see if it presents your intention - even though the code is different.
With that, let us start with our first class of primitive types, the integer types.
1. Integer datatypes
Because we are analyzing code emitted by the Rust compiler, what do we compare it with? We need a frame of reference and that would be the code emitted by gcc - for equivalent C programs. We will first write the C program, look at the generated code, thereby setting up our frame of reference. Then jump into rust code.
Let us consider the following C program.
rust/Rust-C-experiments/primitive-types > cat uint.c
#include <stdio.h>
#include <stdint.h>
void dummy (uint32_t i)
{
printf("%u\n", i);
}
int main ()
{
uint32_t i = 10;
dummy(i);
}
I want you to check how the main
code looks like - when compiled with the 3 gcc optimization levels.
The following is with no optimization or optimization level 0.
000000000040054e <main>:
40054e: 55 push rbp
40054f: 48 89 e5 mov rbp,rsp
400552: 48 83 ec 10 sub rsp,0x10
400556: c7 45 fc 0a 00 00 00 mov DWORD PTR [rbp-0x4],0xa
40055d: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
400560: 89 c7 mov edi,eax
400562: e8 c6 ff ff ff call 40052d <dummy>
400567: c9 leave
400568: c3 ret
400569: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
This is the expected code. First, the some memory is allocated on stack - sub rsp, 0x10
. That is then moved into edi
- which is the register used to pass the first argument. Then dummy
is called. Pretty straightforward.
Think about it for a moment, can this be optimized? Sure it can be. The variable doesn’t have to be stored in the stack - it can be directly loaded to edi
register. With that, the program will still run as intended. That brings us to the code emitted by -O1
.
0000000000400547 <main>:
400547: 48 83 ec 08 sub rsp,0x8
40054b: bf 0a 00 00 00 mov edi,0xa
400550: e8 d8 ff ff ff call 40052d <dummy>
400555: 48 83 c4 08 add rsp,0x8
400559: c3 ret
40055a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
So its being done in -O1. Try to see what can be optimized before you see the optimized code. One wierd thing in the above code is that is is allocating 8 bytes on the stack - sub rsp, 0x8
when it is not needed.
What else can be optimized? Take a look at code emitted with -O2 flag.
0000000000400440 <main>:
400440: be 0a 00 00 00 mov esi,0xa
400445: bf e0 05 40 00 mov edi,0x4005e0
40044a: 31 c0 xor eax,eax
40044c: e9 bf ff ff ff jmp 400410 <printf@plt>
The whole dummy
function is optimized out! There is a direct call to printf
in main
. This is not how we wrote the program, but the program still works as intended. So, no worries.
We have setup the frame of reference. Let us jump into Rust code.
rust/Rust-C-experiments/primitive-types > cat uint.rs
fn main()
{
let mut x: u32 = 10;
dummy(x);
}
fn dummy(mut x: u32)
{
println!("{}", x);
}
Note that the idea is to write rust programs as close to that C program. And in C, everything is mutable by default. We could have used const
in C and removed mut
in Rust. Compiling this will obviously give warnings.
rust/Rust-C-experiments/primitive-types > rustc uint.rs -o uint_rs0 -C opt-level=0
warning: variable does not need to be mutable
--> uint.rs:3:9
|
3 | let mut x: u32 = 10;
| ----^
| |
| help: remove this `mut`
|
= note: `#[warn(unused_mut)]` on by default
warning: variable does not need to be mutable
--> uint.rs:7:10
|
7 | fn dummy(mut x: u32)
| ----^
| |
| help: remove this `mut`
warning: 2 warnings emitted
Run the program and make sure it prints x
. Disassemble it and look at main
’s code.
0000000000008280 <_ZN4uint4main17h8ca9df4d1f781c43E>:
8280: 50 push rax
8281: bf 0a 00 00 00 mov edi,0xa
8286: e8 05 00 00 00 call 8290 <_ZN4uint5dummy17hc21d4f4c11081246E>
828b: 58 pop rax
828c: c3 ret
828d: 0f 1f 00 nop DWORD PTR [rax]
This looks like the code emitted by gcc with -O1
Take a look at the dummy
function.
0000000000008290 <_ZN4uint5dummy17hc21d4f4c11081246E>:
8290: 48 83 ec 78 sub rsp,0x78
8294: 48 8d 35 f5 f4 02 00 lea rsi,[rip+0x2f4f5] # 37790 <_ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$3fmt17h2e6d06c5e3120a4fE>
829b: 89 7c 24 2c mov DWORD PTR [rsp+0x2c],edi
829f: 48 8b 05 d2 d2 23 00 mov rax,QWORD PTR [rip+0x23d2d2] # 245578 <__dso_handle+0x58>
82a6: 48 8d 4c 24 2c lea rcx,[rsp+0x2c]
82ab: 48 89 4c 24 70 mov QWORD PTR [rsp+0x70],rcx
82b0: 48 8b 7c 24 70 mov rdi,QWORD PTR [rsp+0x70]
82b5: 48 89 44 24 20 mov QWORD PTR [rsp+0x20],rax
82ba: e8 41 fe ff ff call 8100 <_ZN4core3fmt10ArgumentV13new17h10447e08ba50c839E>
82bf: 48 89 44 24 18 mov QWORD PTR [rsp+0x18],rax
82c4: 48 89 54 24 10 mov QWORD PTR [rsp+0x10],rdx
82c9: 48 8b 44 24 18 mov rax,QWORD PTR [rsp+0x18]
82ce: 48 89 44 24 60 mov QWORD PTR [rsp+0x60],rax
82d3: 48 8b 4c 24 10 mov rcx,QWORD PTR [rsp+0x10]
82d8: 48 89 4c 24 68 mov QWORD PTR [rsp+0x68],rcx
82dd: 48 8d 54 24 60 lea rdx,[rsp+0x60]
82e2: 48 8d 7c 24 30 lea rdi,[rsp+0x30]
82e7: 48 8b 74 24 20 mov rsi,QWORD PTR [rsp+0x20]
82ec: 41 b8 02 00 00 00 mov r8d,0x2
82f2: 48 89 54 24 08 mov QWORD PTR [rsp+0x8],rdx
82f7: 4c 89 c2 mov rdx,r8
82fa: 48 8b 4c 24 08 mov rcx,QWORD PTR [rsp+0x8]
82ff: 41 b8 01 00 00 00 mov r8d,0x1
8305: e8 46 fe ff ff call 8150 <_ZN4core3fmt9Arguments6new_v117hd291308d219e9dddE>
830a: 48 8d 7c 24 30 lea rdi,[rsp+0x30]
830f: ff 15 7b fb 23 00 call QWORD PTR [rip+0x23fb7b] # 247e90 <_GLOBAL_OFFSET_TABLE_+0x538>
8315: 48 83 c4 78 add rsp,0x78
8319: c3 ret
831a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0]
That is a lot of code. Remember that println!
is a macro and it resolves to actual Rust code - which is not similar to a call to printf
. Having this code in main
pollutes it - we need to know exactly how variables are being stored and used. That is why, it was pushed into a different dummy
function.
What else can be optimized here? We know that the dummy
function can be optimized out. Lets take a look at code emitted with opt-level=1.
0000000000008200 <_ZN4uint4main17h8ca9df4d1f781c43E>:
8200: 50 push rax
8201: e8 0a 00 00 00 call 8210 <_ZN4uint5dummy17hc21d4f4c11081246E>
8206: 58 pop rax
8207: c3 ret
8208: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
820f: 00
0000000000008210 <_ZN4uint5dummy17hc21d4f4c11081246E>:
8210: 53 push rbx
8211: 48 83 ec 50 sub rsp,0x50
8215: c7 44 24 0c 0a 00 00 mov DWORD PTR [rsp+0xc],0xa
821c: 00
821d: 48 8d 35 1c f4 02 00 lea rsi,[rip+0x2f41c] # 37640 <_ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$3fmt17h2e6d06c5e3120a4fE>
8224: 48 8d 7c 24 0c lea rdi,[rsp+0xc]
8229: e8 c2 fe ff ff call 80f0 <_ZN4core3fmt10ArgumentV13new17h10447e08ba50c839E>
822e: 48 89 44 24 10 mov QWORD PTR [rsp+0x10],rax
8233: 48 89 54 24 18 mov QWORD PTR [rsp+0x18],rdx
8238: 48 8d 5c 24 20 lea rbx,[rsp+0x20]
823d: 48 8d 74 24 10 lea rsi,[rsp+0x10]
8242: 48 89 df mov rdi,rbx
8245: e8 86 ff ff ff call 81d0 <_ZN4core3fmt9Arguments6new_v117hdd5eea781ba10264E>
824a: 48 89 df mov rdi,rbx
824d: ff 15 55 fc 23 00 call QWORD PTR [rip+0x23fc55] # 247ea8 <_GLOBAL_OFFSET_TABLE_+0x538>
8253: 48 83 c4 50 add rsp,0x50
8257: 5b pop rbx
8258: c3 ret
8259: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
Observe both the functions. You will notice that the local variable we defined in main
is pushed into the dummy
function. The dummy
function looks shorter. x
is stored in dummy
’s stackframe at rsp+0xc
. The obvious optimization is the removal of the dummy
function. We push back all the code into main
itself. There are three functions being called in dummy
. Let us explore println!
and macros in general in one of the future posts.
The following is the code generated with opt-level=2.
0000000000008380 <_ZN4uint4main17h8ca9df4d1f781c43E>:
8380: 48 83 ec 48 sub rsp,0x48
8384: c7 44 24 04 0a 00 00 mov DWORD PTR [rsp+0x4],0xa
838b: 00
838c: 48 8d 44 24 04 lea rax,[rsp+0x4]
8391: 48 89 44 24 08 mov QWORD PTR [rsp+0x8],rax
8396: 48 8d 05 43 f4 02 00 lea rax,[rip+0x2f443] # 377e0 <_ZN4core3fmt3num3imp52_$LT$impl$u20$core..fmt..Display$u20$for$u20$u32$GT$3fmt17h2e6d06c5e3120a4fE>
839d: 48 89 44 24 10 mov QWORD PTR [rsp+0x10],rax
83a2: 48 8d 05 cf d1 23 00 lea rax,[rip+0x23d1cf] # 245578 <anon.97aceed8034fd8a7f2dff7cb65913bba.0.llvm.245903583275124266+0x30>
83a9: 48 89 44 24 18 mov QWORD PTR [rsp+0x18],rax
83ae: 48 c7 44 24 20 02 00 mov QWORD PTR [rsp+0x20],0x2
83b5: 00 00
83b7: 48 c7 44 24 28 00 00 mov QWORD PTR [rsp+0x28],0x0
83be: 00 00
83c0: 48 8d 44 24 08 lea rax,[rsp+0x8]
83c5: 48 89 44 24 38 mov QWORD PTR [rsp+0x38],rax
83ca: 48 c7 44 24 40 01 00 mov QWORD PTR [rsp+0x40],0x1
83d1: 00 00
83d3: 48 8d 7c 24 18 lea rdi,[rsp+0x18]
83d8: ff 15 ca fa 23 00 call QWORD PTR [rip+0x23faca] # 247ea8 <_GLOBAL_OFFSET_TABLE_+0x538>
83de: 48 83 c4 48 add rsp,0x48
83e2: c3 ret
83e3: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
83ea: 00 00 00
83ed: 0f 1f 00 nop DWORD PTR [rax]
There is just one call
, dummy
is optimized out. x
is stored at rsp+0x4
. Most of the other code should be part of println!
.
This was a rather trivial example - of an integer local variable. Most of the programming languages try to store as many variables as it can in the stack - allocation/deallocation is really fast. That depends on the lifetime of a variable - for how much time should a variable live?. Rust also stored the local variable in the stack/register - no different from C. Note that mut
in Rust or const
in C are both compile-time constructs. Using const
in C and going with no mut
in Rust also would have generated the same/very close code. Make the changes, checkout the emitted code.
2. u32 Implementations
What is interesting is the set of functions to play around with the u32
datatype. This page lists all the functions that are present to play around - swapping bytes, reversing bits, little-endian to big-endian conversion, getting an integer out of an array of bytes and many more.
In C, we have access to raw pointers which makes many of those operations really easy to implement. In Rust, we can’t access variables using raw pointers in safe mode. They can be implemented without pointers, but will be very inefficient. Let us take a couple of examples.
Let us start with a simple function: from_ne_bytes
which takes an array of 4 bytes, considers it to be in native-endianess(this is the endianess of your processor) and converts it into an integer. Let us write a C program that does this.
rust/Rust-C-experiments/primitive-types > cat uint_impl.c
#include <stdio.h>
#include <stdint.h>
void dummy(uint32_t x);
uint32_t from_ne_bytes(uint8_t *arr);
int main()
{
uint8_t arr[] = {0x12, 0x34, 0x56, 0x78};
uint32_t val = from_ne_bytes(arr);
dummy(val);
}
uint32_t from_ne_bytes (uint8_t *arr)
{
return *(uint32_t *)(arr);
}
void dummy (uint32_t x)
{
printf("%x\n", x);
}
Getting an integer from an array of bytes is that easy with pointers!
Let us take a look at the code emitted by gcc with -O0 level.
0000000000400560 <from_ne_bytes>:
400560: 55 push rbp
400561: 48 89 e5 mov rbp,rsp
400564: 48 89 7d f8 mov QWORD PTR [rbp-0x8],rdi
400568: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
40056c: 8b 00 mov eax,DWORD PTR [rax]
40056e: 5d pop rbp
40056f: c3 ret
rdi has the pointer to the byte array. First it is stored in the stack(which can be optimized out). If you have seen assembly code before, you will recognize that the instruction mov eax,DWORD PTR [rax]
is doing the job! Both the typecasting and dereferencing what we did in C is happening in this single instruction. The DWORD PTR
requests the processor to consider the array pointer as a double-word pointer(1 word=2 bytes, dword=4 bytes which is the size of uint32_t) - which is the typecasting bit. The square brackets around rax
is the dereferencing bit.
One obvious optimization is that the byte array pointer passed as argument doesn’t have to be stored in the stack. It can be used directly like this: mov eax, DWORD PTR [rdi]
. Let us take a look at code emitted with -O1 flag.
000000000040052d <from_ne_bytes>:
40052d: 8b 07 mov eax,DWORD PTR [rdi]
40052f: c3 ret
Let us continue and take a look at code generated with -O2 flag.
0000000000400440 <main>:
400440: be 12 34 56 78 mov esi,0x78563412
400445: bf f0 05 40 00 mov edi,0x4005f0
40044a: 31 c0 xor eax,eax
40044c: e9 bf ff ff ff jmp 400410 <printf@plt>
This is some insane optimization. It has optimized both the function calls, it has pre-calculated the integer and it is simply printing it.
Thats all it is. This is our frame of reference, pretty competitive!
Let us write the rust equivalent now.
rust/Rust-C-experiments/primitive-types > cat u32_impl.rs
fn main()
{
let mut arr: [u8; 4] = [0x12, 0x34, 0x56, 0x78];
let mut val = u32::from_ne_bytes(arr);
dummy(val);
}
fn dummy(mut x: u32)
{
println!("{:X}", x);
}
``
Lets compile it with opt-level=0. First, let us take a look at ```main```, which is calling the ```from_ne_bytes()```.
```asm
00000000000082b0 <_ZN8u32_impl4main17ha82c952930c65191E>:
82b0: 48 83 ec 18 sub rsp,0x18
82b4: c6 44 24 10 12 mov BYTE PTR [rsp+0x10],0x12
82b9: c6 44 24 11 34 mov BYTE PTR [rsp+0x11],0x34
82be: c6 44 24 12 56 mov BYTE PTR [rsp+0x12],0x56
82c3: c6 44 24 13 78 mov BYTE PTR [rsp+0x13],0x78
82c8: 8b 44 24 10 mov eax,DWORD PTR [rsp+0x10]
82cc: 89 44 24 14 mov DWORD PTR [rsp+0x14],eax
82d0: 8b 7c 24 14 mov edi,DWORD PTR [rsp+0x14]
82d4: e8 27 fe ff ff call 8100 <_ZN4core3num21_$LT$impl$u20$u32$GT$13from_ne_bytes17haeb35db9bea809e2E>
82d9: 89 44 24 0c mov DWORD PTR [rsp+0xc],eax
82dd: 8b 7c 24 0c mov edi,DWORD PTR [rsp+0xc]
82e1: e8 0a 00 00 00 call 82f0 <_ZN8u32_impl5dummy17h41288951bda61844E>
82e6: 48 83 c4 18 add rsp,0x18
82ea: c3 ret
82eb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
The code is fairly straightforward.
The array is stored in the stack - in the following iinstructions.
82b4: c6 44 24 10 12 mov BYTE PTR [rsp+0x10],0x12
82b9: c6 44 24 11 34 mov BYTE PTR [rsp+0x11],0x34
82be: c6 44 24 12 56 mov BYTE PTR [rsp+0x12],0x56
82c3: c6 44 24 13 78 mov BYTE PTR [rsp+0x13],0x78
Next comes the call to from_ne_bytes
.
82c8: 8b 44 24 10 mov eax,DWORD PTR [rsp+0x10]
82cc: 89 44 24 14 mov DWORD PTR [rsp+0x14],eax
82d0: 8b 7c 24 14 mov edi,DWORD PTR [rsp+0x14]
82d4: e8 27 fe ff ff call 8100 <_ZN4core3num21_$LT$impl$u20$u32$GT$13from_ne_bytes17haeb35db9bea809e2E>
The first instruction in the above snippet actually does the job - considers the byte array pointer as a double-word pointer and copies it into eax
register. Then that integer is stored back into the stack. After that, that integer is passed to the from_ne_bytes
function there. So first point to note is that the function we wrote - it took an array of 4 bytes as argument. That has boiled down to the above four instructions.
If the first instruction already gets the integer we want, then what is the call to from_ne_bytes
doing? Now, let us checkout its code.
0000000000008100 <_ZN4core3num21_$LT$impl$u20$u32$GT$13from_ne_bytes17haeb35db9bea809e2E>:
8100: 48 83 ec 14 sub rsp,0x14
8104: 89 7c 24 08 mov DWORD PTR [rsp+0x8],edi
8108: 8b 44 24 08 mov eax,DWORD PTR [rsp+0x8]
810c: 89 44 24 04 mov DWORD PTR [rsp+0x4],eax
8110: 8b 44 24 04 mov eax,DWORD PTR [rsp+0x4]
8114: 89 44 24 0c mov DWORD PTR [rsp+0xc],eax
8118: 8b 44 24 0c mov eax,DWORD PTR [rsp+0xc]
811c: 89 44 24 10 mov DWORD PTR [rsp+0x10],eax
8120: 8b 44 24 10 mov eax,DWORD PTR [rsp+0x10]
8124: 89 04 24 mov DWORD PTR [rsp],eax
8127: 8b 04 24 mov eax,DWORD PTR [rsp]
812a: 48 83 c4 14 add rsp,0x14
812e: c3 ret
812f: 90 nop
If you slowly observe the above function, nothing significant is happening. The integer passed to it(which is present in edi
register) is stored and loaded a couple of times. Then it is finally loaded into eax
- which is the register used to send back the return value - and it is sent back. I am not sure what these redundant loads-stores mean and what they are doing.
Let us take a look at code emitted with opt-level=1.
0000000000008200 <_ZN8u32_impl4main17ha82c952930c65191E>:
8200: 50 push rax
8201: e8 0a 00 00 00 call 8210 <_ZN8u32_impl5dummy17h41288951bda61844E>
8206: 58 pop rax
8207: c3 ret
8208: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
820f: 00
0000000000008210 <_ZN8u32_impl5dummy17h41288951bda61844E>:
8210: 53 push rbx
8211: 48 83 ec 50 sub rsp,0x50
8215: c7 44 24 0c 12 34 56 mov DWORD PTR [rsp+0xc],0x78563412
821c: 78
821d: 48 8d 35 7c f0 02 00 lea rsi,[rip+0x2f07c] # 372a0 <_ZN4core3fmt3num53_$LT$impl$u20$core..fmt..UpperHex$u20$for$u20$i32$GT$3fmt17h3127d8b68860fbfbE>
8224: 48 8d 7c 24 0c lea rdi,[rsp+0xc]
8229: e8 c2 fe ff ff call 80f0 <_ZN4core3fmt10ArgumentV13new17h3144ad7d58b84069E>
822e: 48 89 44 24 10 mov QWORD PTR [rsp+0x10],rax
8233: 48 89 54 24 18 mov QWORD PTR [rsp+0x18],rdx
8238: 48 8d 5c 24 20 lea rbx,[rsp+0x20]
823d: 48 8d 74 24 10 lea rsi,[rsp+0x10]
8242: 48 89 df mov rdi,rbx
8245: e8 86 ff ff ff call 81d0 <_ZN4core3fmt9Arguments6new_v117hdd5eea781ba10264E>
824a: 48 89 df mov rdi,rbx
824d: ff 15 55 fc 23 00 call QWORD PTR [rip+0x23fc55] # 247ea8 <_GLOBAL_OFFSET_TABLE_+0x538>
8253: 48 83 c4 50 add rsp,0x50
8257: 5b pop rbx
8258: c3 ret
8259: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
We have seen this type of optimization before. Everything is pushed into the dummy
function. The array is optimized out, the integer is pre-calculated and then stored in the stack. This is definitely better than the non-optimized one. With level2 optimization, everything is pushed into a single function.
The extra loads-stores present in unoptimized code needs to be investigated.
I want you to explore the swap_bytes
function. Given a number 0x12345678, calling swap_bytes
on it will return 0x78563412 - the bytes put in reverse direction. What do you is the best way to do it? Rust 1.47 uses the bswap
x86 instruction to get the job done - which I believe is the fastest way to do it. In C, we can define a function and write inline assembly in it doing the same thing.
There are many functions like this. I think we need to see how it is implemented, evaluate its cost and then use it. These functions not only provide an abstraction over pointers etc., but also provides safety. There are functions whi
This short analysis can be extrapolated to i(16, 32, 64) and u(16, 64). I am curious as to how the 128-bit integer types are implemented because they are not native.
3. Arrays
This is going to be a really big subtopic - a lot to explore. We need to see how they are stored in memory, how they are accessed, iterating through them, functions around it.
3.1 How does it look like in memory?
Consider the following C program.
rust/Rust-C-experiments/primitive-types > cat array.c
#include <stdio.h>
#include <stdint.h>
#define ARR_SIZE 100
void dummy(int64_t *arr);
int main ()
{
int64_t arr[ARR_SIZE] = {0};
dummy(arr);
return 0;
}
void dummy (int64_t *arr)
{
int i = 0;
for(i = 0; i < ARR_SIZE; i++)
{
printf("%d ", arr[i]);
}
}
Compile it with lowest optimization and get its disassembly.
rust/Rust-C-experiments/primitive-types > gcc array.c -o array_c0 -O0
rust/Rust-C-experiments/primitive-types > objdump -Mintel -D array_c0 > array_c0.obj
The following is the main
function.
000000000040052d <main>:
40052d: 55 push rbp
40052e: 48 89 e5 mov rbp,rsp
400531: 48 81 ec 20 03 00 00 sub rsp,0x320
400538: 48 8d b5 e0 fc ff ff lea rsi,[rbp-0x320]
40053f: b8 00 00 00 00 mov eax,0x0
400544: ba 64 00 00 00 mov edx,0x64
400549: 48 89 f7 mov rdi,rsi
40054c: 48 89 d1 mov rcx,rdx
40054f: f3 48 ab rep stos QWORD PTR es:[rdi],rax
400552: 48 8d 85 e0 fc ff ff lea rax,[rbp-0x320]
400559: 48 89 c7 mov rdi,rax
40055c: e8 07 00 00 00 call 400568 <dummy>
400561: b8 00 00 00 00 mov eax,0x0
400566: c9 leave
400567: c3 ret
Go through it and try to understand what each line does.
The lea rsi,[rbp-0x320]
loads the array’s starting address to rsi
register. The interesting part of the zeroizing of that array. In the C program, we have initialized the whole array to 0 - int64_t arr[100] = {0}
. In C, we simply write 1 zero in the flower brackets and finish it. But how does it actually happen at runtime? It needs to zeroize 8 * 100 = 800 bytes of stack memory. Take a look at the following instructions.
40053f: b8 00 00 00 00 mov eax,0x0
400544: ba 64 00 00 00 mov edx,0x64
400549: 48 89 f7 mov rdi,rsi
40054c: 48 89 d1 mov rcx,rdx
40054f: f3 48 ab rep stos QWORD PTR es:[rdi],rax
These 4 instructions gets the job done. rcx
is initialized to 100. rax
is loaded to 0. rdi
is loaded with array’s starting address. Then rep stos QWORD PTR es:[rdi],rax
is executed. Let us checkout what this instruction does. The stos
OR the **Store Stringinstruction stores the 8 bytes in rax into 8 bytes pointed by address in
rdi.
rdiis incremented by 8,
rcxis decremented by 1.
repstands for
repeatwhich is the **prefix** to the
stosinstruction. This
repprefix brings in the loop functionality. The loop ends when
rcx``` hits 0. The following is the C-style code of the above 5 instructions.
int64_t rcx = 100;
int64_t *rdi = arr;
int64_t rax = 0;
while(rcx > 0)
{
*rdi = rax;
rdi += 1; /* Note that rdi is incremented by 8 and not 1 */
rcx -= 1;
}
Really cool isn’t it! Two things about this amazes me. One is that such a complex instruction is present in x64 and the second is that gcc is using it. Using underlying instructions to get the job done is probably the fastest way to do it. Even with -O2 flag, the exact same instructions are used to zeroize it.
Now we have our frame of reference, again very competitive.
Lets look at the rust program.
rust/Rust-C-experiments/primitive-types > cat array.rs
fn main ()
{
let mut arr: [i64; 100] = [0; 100];
dummy(arr);
}
fn dummy (mut arr: [i64; 100])
{
let mut i = 0;
while i < 100
{
println!("{}", arr[i]);
i += 1;
}
}
rust/Rust-C-experiments/primitive-types > rustc array.rs -o array_rs0 -C opt-level=0
In this subsection, let us focus on how the array is stored, how it is zeroized etc., Lets take a look at main
’s disassembly.
0000000000008360 <_ZN5array4main17h6cec91fda9225fadE>:
8360: 48 81 ec 58 06 00 00 sub rsp,0x658
8367: 48 8d 44 24 18 lea rax,[rsp+0x18]
836c: 31 f6 xor esi,esi
836e: 48 89 c1 mov rcx,rax
8371: 48 89 cf mov rdi,rcx
8374: b9 20 03 00 00 mov ecx,0x320
8379: 48 89 ca mov rdx,rcx
837c: 48 89 44 24 10 mov QWORD PTR [rsp+0x10],rax
8381: 48 89 4c 24 08 mov QWORD PTR [rsp+0x8],rcx
8386: e8 25 fc ff ff call 7fb0 <memset@plt>
Damn! It is using memset
to zeroize the array. How fast/slow is memset compared to our assembly-implementation we saw before? Let us checkout memset
’s code.
gdb-peda$ disass memset
Dump of assembler code for function memset:
=> 0x00007ffff7df5a40 <+0>: mov rcx,rdx
0x00007ffff7df5a43 <+3>: movzx eax,sil
0x00007ffff7df5a47 <+7>: mov rdx,rdi
0x00007ffff7df5a4a <+10>: rep stos BYTE PTR es:[rdi],al
0x00007ffff7df5a4c <+12>: mov rax,rdx
0x00007ffff7df5a4f <+15>: ret
End of assembler dump.
I used gdb on a program and disassembled memset - which is part of libc. And look at its implementation! It also uses the assembly-implementation, but there are sutle differences between this and the code we saw in the C program which matter - when it comes to performance.
Notice the stos
instruction: rep stos BYTE PTR es:[rdi],al
. Here, zeroization is done byte by byte - al
(which is one byte) is copied into the byte pointed by rdi
. rdi
is incremented by 1, rcx
is decremented by 1. But in the above C-emitted code, zeroization happened in batch of 8 bytes. Because the C compiler knows that the sizeof(int64_t)
is 8 bytes, it generated the instructions accordingly. In C, entire zeroization happens in 100 iterations, but here it takes 800 iterations.
I don’t want to infer that the C-emitted code is faster than memset - just by looking at the number of iterations. Because we don’t know how hardware implements this looping and the stos
instruction. The stos QWORD PTR es:[rdi], rax
which copies 8 bytes at once - might be getting resolved to simpler microcode which does a byte-level copy - not sure. Once the main content of this post is over, let us look at how we can measure the two variants.
Moving forward, the following is the second half of main
.
838b: 48 8d 84 24 38 03 00 lea rax,[rsp+0x338]
8392: 00
8393: 48 89 c1 mov rcx,rax
8396: 48 8b 54 24 10 mov rdx,QWORD PTR [rsp+0x10]
839b: 48 89 cf mov rdi,rcx
839e: 48 89 d6 mov rsi,rdx
83a1: 48 8b 54 24 08 mov rdx,QWORD PTR [rsp+0x8]
83a6: 48 89 04 24 mov QWORD PTR [rsp],rax
83aa: e8 51 fc ff ff call 8000 <memcpy@plt>
83af: 48 8b 3c 24 mov rdi,QWORD PTR [rsp]
83b3: e8 08 00 00 00 call 83c0 <_ZN5array5dummy17h343f70d551062a4dE>
83b8: 48 81 c4 58 06 00 00 add rsp,0x658
83bf: c3 ret
Thats not it. Before dummy
is called, a call to memcpy
happens. What is that doing? If you read through the assembly code, you’ll see that a copy of the array is made and that is being passed to dummy
. This looks like a pass by value of an array - which is not directly possible in C. In C, structures can be passed by value, but not arrays. Arrays can only be passed by reference. Let us examine the code a bit closely. Here, Rust is behaving like a high-level language. We can pass an array by value. But even there, a reference to the copy-array is passed to dummy
. The following C code roughly mimics the Rust code.
int main()
{
int64_t arr[100] = {0};
int64_t arr_copy[sizeof(arr)];
memcpy(arr_copy, arr, sizeof(arr_copy));
dummy(arr_copy);
}
Rust’s pass by value here is the same as making a copy of the array and passing that copy as reference in the way we do it in C.
The following is the modified version of array.rs
.
rust/Rust-C-experiments/primitive-types > cat array.rs
fn main ()
{
let mut arr: [i64; 100] = [0; 100];
dummy(&mut arr);
}
fn dummy (arr: &mut [i64; 100])
{
let mut i = 0;
while i < 100
{
println!("{}", arr[i]);
i += 1;
}
}
With opt-level=0, the following code is generated.
0000000000008370 <_ZN5array4main17h6cec91fda9225fadE>:
8370: 48 81 ec 28 03 00 00 sub rsp,0x328
8377: 48 8d 44 24 08 lea rax,[rsp+0x8]
837c: 31 f6 xor esi,esi
837e: 48 89 c1 mov rcx,rax
8381: 48 89 cf mov rdi,rcx
8384: ba 20 03 00 00 mov edx,0x320
8389: 48 89 04 24 mov QWORD PTR [rsp],rax
838d: e8 3e fc ff ff call 7fd0 <memset@plt>
8392: 48 8b 3c 24 mov rdi,QWORD PTR [rsp]
8396: e8 15 00 00 00 call 83b0 <_ZN5array5dummy17hab521586f71a1064E>
839b: 48 81 c4 28 03 00 00 add rsp,0x328
83a2: c3 ret
83a3: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
83aa: 00 00 00
83ad: 0f 1f 00 nop DWORD PTR [rax]
Rust’s pass by reference here is the same as C’s pass by reference.
With opt-level=1, there is not much difference here - in the main function. With opt-level=2, it optimized out the dummy
function and put it inside main
itself which did not happen with gcc.
I think this can be extended to arrays of i(8, 16, 32) and u(8, 16, 32, 64) declared inside a function - as a local variable.
Now comes the interesting, juicy part - array access.
3.2 Array access
In this section, we see how array access happens. Rust is supposed to be a safer language. It panics when an out-of-bounds array access is made. How does it work at runtime? Lets explore.
Lets take a look at first few lines of dummy
function.
00000000000083b0 <_ZN5array5dummy17hab521586f71a1064E>:
83b0: 48 81 ec 88 00 00 00 sub rsp,0x88
83b7: 48 c7 44 24 38 00 00 mov QWORD PTR [rsp+0x38],0x0
83be: 00 00
83c0: 48 89 7c 24 30 mov QWORD PTR [rsp+0x30],rdi
83c5: 48 83 7c 24 38 64 cmp QWORD PTR [rsp+0x38],0x64
83cb: 72 08 jb 83d5 <_ZN5array5dummy17hab521586f71a1064E+0x25>
83cd: 48 81 c4 88 00 00 00 add rsp,0x88
83d4: c3 ret
The code is fairly simple. Some memory is allocated in the stack and the local variable is initialized to 0. The pointer to the array is stored in the stack. Then comes the following three instructions.
83c5: 48 83 7c 24 38 64 cmp QWORD PTR [rsp+0x38],0x64
83cb: 72 08 jb 83d5 <_ZN5array5dummy17hab521586f71a1064E+0x25>
83cd: 48 81 c4 88 00 00 00 add rsp,0x88
83d4: c3 ret
We are comparing i
with 100(or 0x64). If it is not below/lesser, then we return. If it is below/lesser, the jb
or jump-if-below instruction succeeds and it jumps to the instruction present after ret
. So, it is safe to assume that this comparison is the one present as part of the while loop.
Let us take a look at the next couple of lines.
83d5: 48 8b 05 7c d1 23 00 mov rax,QWORD PTR [rip+0x23d17c] # 245558 <__dso_handle+0x58>
83dc: 48 8b 4c 24 38 mov rcx,QWORD PTR [rsp+0x38]
83e1: 48 83 f9 64 cmp rcx,0x64
83e5: 0f 92 c2 setb dl
83e8: f6 c2 01 test dl,0x1
83eb: 48 89 44 24 28 mov QWORD PTR [rsp+0x28],rax
83f0: 48 89 4c 24 20 mov QWORD PTR [rsp+0x20],rcx
83f5: 75 05 jne 83fc <_ZN5array5dummy17hab521586f71a1064E+0x4c>
83f7: e9 a8 00 00 00 jmp 84a4 <_ZN5array5dummy17hab521586f71a1064E+0xf4>
Do you notice something unusual? The local variable is again being compared with 100.
These lines use non-common instructions like setb and test. Let us first understand what they do and then come back to understanding why another check is present.
When the cmp
instruction executed, it may set flags(in the eflags
register) based on the result of comparison. Based on what flags are set(=1) and not set(=0), we can take decisions - do we come out of the loop, should we come out of that if block etc., Here, when rcx
is greated than equal to 100, the Carry Flag is set(to 1) in the eflags
register. The setb
copies the Carry bit(either 0 or 1) into the dl
sub-register. test
basically compares two operands. Why is this needed? Let us take the case when the Carry-bit is 1.
We know that the carry bit is set only when the first operand(here rcx
or our iterator variable i
). If it is 1, then i
is greater than or equal to 100. The Carry-bit (1) is copied into dl
using the setb
instruction. The test
will pass and we jmp
to the address 0x84a4. Let us see what is present at this address.
84a4: 48 8d 15 b5 d0 23 00 lea rdx,[rip+0x23d0b5] # 245560 <__dso_handle+0x60>
84ab: 48 8d 05 4e ad 02 00 lea rax,[rip+0x2ad4e] # 33200 <_ZN4core9panicking18panic_bounds_check17h2e8c50d2fb4877c0E>
84b2: be 64 00 00 00 mov esi,0x64
84b7: 48 8b 7c 24 20 mov rdi,QWORD PTR [rsp+0x20]
84bc: ff d0 call rax
There you go! the core::panicking::panic_bounds_check
gets called.
We wrote only one check which we encountered above. This comparison is the code added by the compiler- to make sure the index is always in bounds. If you look at the function, you will notice that this check is done before every array access. Here, there are 100 elements, so there are 100 out-of-bound checks along with the traditional loop-ending check we have put.
This is not apparent from our code - because our while loop is a good and ends when i hits 100. So both the cmp
s compares i
with 100. Let us change the loop-ending check to 200 and come back to the assembly code. The first few lines look like the following.
00000000000083b0 <_ZN5array5dummy17hab521586f71a1064E>:
83b0: 48 81 ec 88 00 00 00 sub rsp,0x88
83b7: 48 c7 44 24 38 00 00 mov QWORD PTR [rsp+0x38],0x0
83be: 00 00
83c0: 48 89 7c 24 30 mov QWORD PTR [rsp+0x30],rdi
83c5: 48 81 7c 24 38 c8 00 cmp QWORD PTR [rsp+0x38],0xc8
83cc: 00 00
83ce: 72 08 jb 83d8 <_ZN5array5dummy17hab521586f71a1064E+0x28>
83d0: 48 81 c4 88 00 00 00 add rsp,0x88
83d7: c3 ret
83d8: 48 8b 05 79 d1 23 00 mov rax,QWORD PTR [rip+0x23d179] # 245558 <__dso_handle+0x58>
83df: 48 8b 4c 24 38 mov rcx,QWORD PTR [rsp+0x38]
83e4: 48 83 f9 64 cmp rcx,0x64
83e8: 0f 92 c2 setb dl
83eb: f6 c2 01 test dl,0x1
83ee: 48 89 44 24 28 mov QWORD PTR [rsp+0x28],rax
83f3: 48 89 4c 24 20 mov QWORD PTR [rsp+0x20],rcx
83f8: 75 05 jne 83ff <_ZN5array5dummy17hab521586f71a1064E+0x4f>
83fa: e9 a8 00 00 00 jmp 84a7 <_ZN5array5dummy17hab521586f71a1064E+0xf7>
The first cmp
is checking against 0xc8 (which is 200). The second cmp
is checking 0x64/100 - which is the length of the array. The C-equivalent code would look like this.
while(i < 200)
{
assert(i < 100);
printf("%ld\n", i);
i += 1;
}
If an assert
fails, it will kill(or abort) the process with a SIGABRT
.
I hope you are able to appreciate how simple the bound-checking code is - just what is needed.
Out-of-bound read/write can be deadly at times. It is illegal and it should not happen. You can take a look at this list of out-of-bound reads/writes found in various software.
Once this index is checked, the array element is printed.
There is one fundamental point to understand here. What are we trying to do here? We are iterating through an array with 100 elements. What did the compiler understand from our code? There is a loop which goes from 0 to 99. That loop has a body. That body has one line - which is an array access. Here, the looping and array access are independent of each other. That is probably why it generated bound-checking code. But what does iterating over an array mean? It means I go over the array and definitely won’t go below/beyond it - that is the definition. We are not conveying that idea to the compiler with a while loop and an array access inside it. How can we explicitly tell the compiler that we will be iterating through the array ie., we have no intention of going beyond the array? Consider the iter()
abstraction. Change array.rs to use iter()
instead of the while loop.
fn dummy (arr: &mut [i64; 100])
{
for val in arr.iter()
{
println!("{}", val);
}
}
Run the program, make sure its working as intended.
Checkout dummy
’s code. You will see that there is no additional check, no panic-related code in the function.
But what about optimized versions of the code with while-loop? The optimized version had no bound-checking code when we looped from 0-99. I guess the compiler is intelligent enough. But I think it is good to convey what exactly you are doing, what you want by using the right constructs for the job, especially when the compiler understands it very well.
That was a short introduction to arrays.
4. Tuple
Of all the primitive datatypes, I find Tuple to be the most interesting one. Arrays - it is an array or a collection of objects of the same type. But a rust’s tuple have members of different datatypes. Even a structure can have members of different datatypes, but the difference is that in a tuple, you can access members by its position in the tuple but we cannot iterate through it the way we do for arrays. It will be interesting to see how a tuple looks like at assembly level, how it works.
Lets start with a simple program which initialized a tuple and prints it.
rust/Rust-C-experiments/primitive-types > cat tuple.rs
fn main()
{
let tuple = ("Hello", 5, 'c');
dummy(&tuple);
}
fn dummy(tuple: &(&str, i32, char))
{
println!("({}, {}, {})", tuple.0, tuple.1, tuple.2);
}
Compile it and run it. Make sure it givs the intended output. Also get its disassembly.
0000000000008420 <_ZN5tuple4main17hcb4c93a240c6cf54E>:
8420: 48 83 ec 18 sub rsp,0x18
8424: 48 8d 05 35 02 03 00 lea rax,[rip+0x30235] # 38660 <_fini+0x10>
842b: 48 89 04 24 mov QWORD PTR [rsp],rax
842f: 48 c7 44 24 08 05 00 mov QWORD PTR [rsp+0x8],0x5
8436: 00 00
8438: c7 44 24 10 05 00 00 mov DWORD PTR [rsp+0x10],0x5
843f: 00
8440: c7 44 24 14 63 00 00 mov DWORD PTR [rsp+0x14],0x63
8447: 00
8448: 48 89 e7 mov rdi,rsp
844b: e8 10 00 00 00 call 8460 <_ZN5tuple5dummy17h36779743d02e81f4E>
8450: 48 83 c4 18 add rsp,0x18
8454: c3 ret
8455: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
845c: 00 00 00
The second instruction lea rax,[rip+0x30235]
is actually fetching the string literal’s "Hello"
’s address. Let us checkout what is present at 0x38660.
0000000000038660 <str.3-0x1310>:
38660: 48 rex.W
38661: 65 6c gs ins BYTE PTR es:[rdi],dx
38663: 6c ins BYTE PTR es:[rdi],dx
38664: 6f outs dx,DWORD PTR ds:[rsi]
38665: 28 2c 20 sub BYTE PTR [rax+riz*1],ch
38668: 29 0a sub DWORD PTR [rdx],ecx
3866a: 00 00 add BYTE PTR [rax],al
Do not care about those assembly instructions - they don’t have any value. When objdump is specified with the -D option, it disassembles the complete files - strings, tables etc., whatever is present in the binary. What we are seeing is the disassembly of a string - which does not make sense.
If you look at an ascii conversion table, you will see that the Hello string is present there.
Coming back to the disassembly, right after the string literal’s address is stored, its size(= 5) is also stored. Then comes the integer 0x5 and then the character - ‘c’. A tuple of type (&str, i32, char)
looks like the following in memory.
X : String literal's address (8 bytes)
X+8 : String literal's size (8 bytes)
X+16: i32 (4 bytes)
X+20: char (4 bytes)
And then we are passing a reference of the tuple to the function. From the above example, X is being passed to the function. If you have X(the starting address of the tuple) and the the tuple’s prototype, then one can access the tuple from memory without any ambiguity.
This is how the elements are accessed in the dummy
function.
0000000000008460 <_ZN5tuple5dummy17h36779743d02e81f4E>:
8460: 48 81 ec c8 00 00 00 sub rsp,0xc8
8467: 48 8b 05 0a d1 23 00 mov rax,QWORD PTR [rip+0x23d10a] # 245578 <__dso_handle+0x78>
846e: 48 89 f9 mov rcx,rdi
8471: 48 89 fa mov rdx,rdi
8474: 48 81 c2 10 00 00 00 add rdx,0x10
847b: 48 81 c7 14 00 00 00 add rdi,0x14
8482: 48 89 8c 24 b0 00 00 mov QWORD PTR [rsp+0xb0],rcx
8489: 00
848a: 48 89 94 24 b8 00 00 mov QWORD PTR [rsp+0xb8],rdx
8491: 00
8492: 48 89 bc 24 c0 00 00 mov QWORD PTR [rsp+0xc0],rdi
8499: 00
The register rdi
points to the tuple. tuple.0
should give you access to the string literal. Because rdi
points to the tuple and because the string literal is the first element, rdi
points to the tuple. That is loaded into rcx``. Then
rdxis loaded with
rdi+0x10which basically points to the
tuple.1- the second tuple element.
rdiitself is incremented by 0x14/20 bytes - after increment points to the tuple's third member. Next few lines are related to
println!```.
The way the tuple is laid down in memory or the way it’s members are accessed is very straightforward. Let us take a look at code emitted with opt-level=1. The following is main
’s code.
0000000000008270 <_ZN5tuple4main17hcb4c93a240c6cf54E>:
8270: 48 83 ec 18 sub rsp,0x18
8274: 48 8d 05 85 01 03 00 lea rax,[rip+0x30185] # 38400 <_fini+0x10>
827b: 48 89 04 24 mov QWORD PTR [rsp],rax
827f: 48 c7 44 24 08 05 00 mov QWORD PTR [rsp+0x8],0x5
8286: 00 00
8288: 48 b8 05 00 00 00 63 movabs rax,0x6300000005
828f: 00 00 00
8292: 48 89 44 24 10 mov QWORD PTR [rsp+0x10],rax
8297: 48 89 e7 mov rdi,rsp
829a: e8 11 00 00 00 call 82b0 <_ZN5tuple5dummy17h36779743d02e81f4E>
829f: 48 83 c4 18 add rsp,0x18
82a3: c3 ret
82a4: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
82ab: 00 00 00
82ae: 66 90 xchg ax,ax
The string literal’s address and its size is stored in the stack. The compiler has recognized that the second and third members are integer and char. Integer ix 5 OR 0x0000005. char is 0x00000063. Both are combined and stored using a single mov
instruction. No changes in the memory layout of a tuple - just the way it is done has changed a little bit. The pointer to it is sent to dummy
by loading it into rdi
. The code emitted by opt-level=2 optimized out the dummy
function.
Tuples can be really long(but its length shoul be known at compile-time), there can be tuple of tuples - any combination.
With that, we end our exploration on primitive datatypes.
5. A couple of interesting things
5.1 Zeroization: memset vs. emitted-C code
When we were exploring arrays, we saw arrays were zeroized in C and in Rust. The C compiler made use of specific string-based x86 instructions to zeroize the array. Rust simply called memset. For an int64_t
array, the following is the zeroization code emitted by gcc.
40053f: b8 00 00 00 00 mov eax,0x0
400544: ba 64 00 00 00 mov edx,0x64
400549: 48 89 f7 mov rdi,rsi
40054c: 48 89 d1 mov rcx,rdx
40054f: f3 48 ab rep stos QWORD PTR es:[rdi],rax
The following is how memset
zeroizes an array.
=> 0x00007ffff7df5a40 <+0>: mov rcx,rdx
0x00007ffff7df5a43 <+3>: movzx eax,sil
0x00007ffff7df5a47 <+7>: mov rdx,rdi
0x00007ffff7df5a4a <+10>: rep stos BYTE PTR es:[rdi],al
0x00007ffff7df5a4c <+12>: mov rax,rdx
0x00007ffff7df5a4f <+15>: ret
We saw that the C version zeroized the entire array in batches of 8 bytes - basically copied the contents of rax
register into memory. But in memset
, al
- which is 1 byte in size is copied. For an int64_t
array of 100 elements(a total of 800 bytes), the C version would be done in 100 iterations, but memset would take 800 iterations - because it is zeroizing one byte at a time. Just by comparing the iterations, we cannot infer that the C version is faster - because lesser number of iterations. Because we don’t know how the processor implements the stos
instruction, we need a better way to evaluate their speeds. To do this, let us write two assembly programs - both will be zeroizing an array of 10,000 bytes. But the first will be using the C version, second will use memset code. We won’t be calling memset, instead let us use those instructions. We can see how much time each program takes.
The following is the first program.
rust/Rust-C-experiments/primitive-types > cat zeroize_c.s
#
# zeroize_c.s
#
# C style zeroizing - in batches
#
.global _start
.text
_start:
# Make a loop which will keep calling zeroize
mov $1000000, %r15
z_loop:
mov $100000, %rdi # Buffer size in bytes
call zeroize
dec %r15
cmp $0, %r15
jnz z_loop
exit:
mov $60, %rax # exit's system call number
mov $0, %rbx # exit(0)
syscall
zeroize:
push %rbp
mov %rsp, %rbp
sub %rdi, %rsp # Allocate memory on the stack
# memset style zeroization code
mov $0, %rax # Zeroize
mov $12500, %rcx # Number of iterations = number of bytes/8
mov %rsp, %rdi # Move the starting address into rdi
rep stosq # 8-bytes/quad-word at a time
leave
ret
All you need to focus on is the code under the zeroize
label. The assembly-syntax used here is the AT&T syntax but what we saw in objdump is called the Intel syntax(-Mintel option). A stack buffer of 1,00,000 bytes is allocated and zeroized 8 bytes at a time. Just calling the zeroize
function got over in a couple of milliseconds. To compare time, either we have to increase the buffer size or keep calling zeroize
again and again. Increasing buffer size was not feasible. The second option is a feasible one. We call zeroize 1,000,000 times.
Let us compile and run it.
rust/Rust-C-experiments/primitive-types > gcc zeroize_c.s -o zeroize_c -nostdlib
Notice that we are not linking our program against libc because we don’t need it. Run it and measure the time.
t/Rust-C-experiments/primitive-types > time ./zeroize_c
real 0m1.354s
user 0m1.353s
sys 0m0.001s
Now consider the memset version.
#
# zeroize_memset.s
#
# memset style zeroizing - byte size zeroizing.
#
.global _start
.text
_start:
# Make a loop which will keep calling zeroize
mov $1000000, %r15
z_loop:
mov $100000, %rdi # Buffer size in bytes
call zeroize
dec %r15
cmp $0, %r15
jnz z_loop
exit:
mov $60, %rax # exit's system call number
mov $0, %rbx # exit(0)
syscall
zeroize:
push %rbp
mov %rsp, %rbp
sub %rdi, %rsp # Allocate memory on the stack
# memset style zeroization code
mov $0, %rax # Zeroize
mov %rdi, %rcx # Number of iterations = number of bytes
mov %rsp, %rdi # Move the starting address into rdi
rep stosb # byte wise
leave
ret
You can see that the only difference is that here we are doing a byte-wise zeroization. Compile and time it.
rust/Rust-C-experiments/primitive-types > time ./zeroize_memset
real 0m1.403s
user 0m1.401s
sys 0m0.002s
Run both of them a couple of times. What I observed was that the times were very close and similar. From this, I will assume that both are of the same speed. This makes me wonder how the rep stos
instruction is implemented in the processor.
6. Conclusion
So far, we have explored a couple of primitive datatypes, how they are stored in memory when defined in a function, how they are accessed - how the code looks when compared to C emitted code. What we have explored here is a very small portion of possibilities. There are different integer datatypes, there are floating datatypes(which we didn’t explore), there are arrays - array of different other primitive datatypes(including of array of tuples), tuple of arrays - any of those possibilities. With the basic analysis done so far, you should be able get a rough idea of how the rustc-emitted code might look like, how a construct looks like in memory and how the corresponding emitted code looks like. Only way to explore more is to try out wierd combinations, lot of examples, expect how it would look like in memory - then checkout the assembly code and compare your analysis with what is present.
I ignored a couple of topics - like function calls, how arguments are passed, pass-by-value, pass-by-reference etc., even though we encountered them. Let us explore all these along with other constructs and features in future posts.