A teardown of Rust pointer types and functors

12 min readMay 12, 2023

Introduction

I wanted to come back to some of my tests of key structures and look in detail at what Rust does so I made test cases that were analogous to the C++ versions I did a few weeks ago. I had a pretty good idea what to expect but I was actually surprised, the code is better than I expected across the board pretty much.

Let’s look at the first program, this is vanilla (not atomic) reference counted pointers. The codegen we’ll see is optimized for space.

Reference Counter Pointer Test Cases

use std::rc::Rc;

// hoist the formatting out so that it doesn't contaminate
// the cost of shared_ptr
#[inline(never)]
fn printf(s: &str, i:i32) {
  println!("{}{}", s, i);
}

// what we want to understand in detail
#[inline(never)]
fn shared_ptr() {
    // Create an Rc pointer to an i32 value
    let ptr1: Rc<i32> = Rc::new(42);
    let ptr2 = Rc::clone(&ptr1);

    // equivalent to a C style printf call (it isn't really varargs)
    printf("Value: ", *ptr1);
    printf("Cloned Value: ", *ptr2);
}

#[inline(never)]
fn main() {
    shared_ptr();
    println!("Hello, world!");
}

As we dig into this you’ll notice that the calling convention is different. That’s because I did these experiments with Linux to make things easy on myself. So the ABI is not the Windows ABI. This doesn’t make too much difference but you do see that frame setup is a bit lighter.

System V X64 ABI (Application Binary Interface):

General-purpose registers:

RDI: Used for the first function argument.
RSI: Used for the second function argument.
RDX: Used for the third function argument.
RCX: Used for the fourth function argument.
R8 and R9: Used for the fifth and sixth function arguments, respectively.
RAX: Used for the return value.
RBX, RBP, R12-R15: Callee-saved registers that must be preserved across function calls.

Floating-point and SIMD registers:

XMM0-XMM7: Used for passing and returning floating-point or SIMD values.

The main thing we have to look at is the codegen for the shared_ptr function.

hello`hello::shared_ptr::h770517f0dc95aff5:
    // stash rbx, we use it and it's preserved
    0x0c967 <+0>:   push   rbx

    // allocate room for ptr1 and ptr2
    0x0c968 <+1>:   sub    rsp, 0x10

    // alloc i32 shared ptr value 42 (rdi is arg1)
    0x0c96c <+5>:   push   0x2a
    0x0c96e <+7>:   pop    rdi

    // alloc::rc::Rc$LT$T$GT$::new::h6e5de24059a41191 at rc.rs:374
    0x0c96f <+8>:   call   0x0ca63            ; 

    // store returned Rc<i32> into ptr1
    0x0c974 <+13>:  mov    qword ptr [rsp], rax

    // increment the ref count (not atomic)
    // [inlined] alloc::rc::RcInnerPtr::inc_strong
    0x0c978 <+17>:  inc    qword ptr [rax]

    // this is a sanity check, the count should be non-zero   
    0x0c97b <+20>:  je     0x0c9c1            ; <+90>    

    // rbx is preserved
    0x0c97d <+22>:  mov    rbx, rax

    // save ptr2
    0x0c980 <+25>:  mov    qword ptr [rsp + 0x8], rax

    // fetch the value out of the shared pointer, this will be arg3
    0x0c985 <+30>:  mov    edx, dword ptr [rax + 0x10]

    // "Value:"
    0x0c988 <+33>:  lea    rdi, [rip + 0x36672]
    // length 7
    0x0c98f <+40>:  push   0x7
    0x0c991 <+42>:  pop    rsi

    // arg1 is rdi (string), arg2 is rsi (length), arg3 is rdx
    // hello::printf::hddab0e4c0f1b5399 at main.rs:5
    0x0c992 <+43>:  call   0x0c8fb

    // now load the value out of ptr2 which is really the same as ptr1
    // but now stored in rbx
    0x0c997 <+48>:  mov    edx, dword ptr [rbx + 0x10]

    // get the string into rd1
    0x0c99a <+51>:  lea    rdi, [rip + 0x36667]

    // length 14
    0x0c9a1 <+58>:  push   0xe
    0x0c9a3 <+60>:  pop    rsi

    // arg1 is rdi (string), arg2 is rsi (length), arg3 is rdx
    // hello::printf::hddab0e4c0f1b5399 at main.rs:5   
    0x0c9a4 <+61>:  call   0x0c8fb             

    // arg1 is ptr2, call teardown helper
    0x0c9a9 <+66>:  lea    rdi, [rsp + 0x8]   

    // drop reference count ptr2
    0x0c9ae <+71>:  call   0x0ca94
 
    // arg1 is ptr1, call teardown helper
    0x0c9b3 <+76>:  mov    rdi, rsp

    // drop reference count ptr1
    0x0c9b6 <+79>:  call   0x0ca94

    // clear the frame and we're done
    0x0c9bb <+84>:  add    rsp, 0x10
    0x0c9bf <+88>:  pop    rbx
    0x0c9c0 <+89>:  ret

    // assorted things for exception cases not in main flow
    0x0c9c1 <+90>:  ud2  // undefined opcode, for sure crash
    ----
    0x0c9c3 <+92>:  ud2
    ----    // pointer teardown
    0x0c9c5 <+94>:  mov    rbx, rax
    0x0c9c8 <+97>:  lea    rdi, [rsp + 0x8]
    // drop ptr2
    0x0c9cd <+102>: call   0x0c8f6                // drop ptr1    
    0x0c9d2 <+107>: mov    rdi, rsp
    0x0c9d5 <+110>: call   0x0c8f6 
    0x0c9da <+115>: mov    rdi, rbx    // _Unwind_Resume
    0x0c9dd <+118>: call   0x0a040 
    0x0c9e2 <+123>: ud2    // drop helper for Arc, drops arg1 (rdi)
    // strong then weak
    // note not atomic!
    0x0ca94 <+0>:  mov    rdi, qword ptr [rdi]
    0x0ca97 <+3>:  dec    qword ptr [rdi]
    0x0ca9a <+6>:  jne    0x0caa2            ; <+14> at rc.rs:1604:6
    0x0ca9c <+8>:  dec    qword ptr [rdi + 0x8]
    0x0caa0 <+12>: je     0x0caa3            ; <+15> at rc.rs
    0x0caa2 <+14>: ret    // dealloc if needed
    0x0caa3 <+15>: push   0x18
    0x0caa5 <+17>: pop    rsi
    0x0caa6 <+18>: push   0x8
    0x0caa8 <+20>: pop    rdx
    0x0caa9 <+21>: jmp    qword ptr [rip + 0x47101]

If we look at this again considering the atomic version Arc the codegen is very similar.

use std::sync::Arc;

#[inline(never)]
fn printf(s: &str, i:i32) {
  println!("{}{}", s, i);
}

// what we want to understand in detail
#[inline(never)]
fn shared_ptr() {
    // Create an Rc pointer to an i32 value
    let ptr1: Arc<i32> = Arc::new(42);
    let ptr2 = Arc::clone(&ptr1);

    // equivalent to a C style printf call (it isn't really varargs)
    printf("Value: ", *ptr1);
    printf("Cloned Value: ", *ptr2);
}

#[inline(never)]
fn main() {
    shared_ptr();
    println!("Hello, world!");
}

The helper functions are the atomic ones, but the pattern is the same. The inline code uses the lock prefix now.

hello`hello::shared_ptr::h770517f0dc95aff5:
    // preserve rbx, space for ptr1 & ptr2
    0x0c98e <+0>:   push   rbx
    0x0c98f <+1>:   sub    rsp, 0x10

    // alloc args rsi, rdi
    0x0c993 <+5>:   push   0x18
    0x0c995 <+7>:   pop    rdi
    0x0c996 <+8>:   push   0x8
    0x0c998 <+10>:  pop    rsi
    //  alloc::sync::Arc$LT$T$GT$::new
    0x0c999 <+11>:  call   qword ptr [rip + 0x471a9]  

    // failed alloc case
    0x0c99f <+17>:  test   rax, rax
    0x0c9a2 <+20>:  je     0x0ca03    ; <+117> at main.rs

    // save ptr1 in rbx
    0x0c9a4 <+22>:  mov    rbx, rax
    0x0c9a7 <+25>:  push   0x1
    0x0c9a9 <+27>:  pop    rax

    // set ref count 1 in weak and strong count
    0x0c9aa <+28>:  mov    qword ptr [rbx], rax
    0x0c9ad <+31>:  mov    qword ptr [rbx + 0x8], rax

    // store the 42 into the value part
    0x0c9b1 <+35>:  mov    dword ptr [rbx + 0x10], 0x2a

    // copy rbx into ptr2, not atomic inc.
    0x0c9b8 <+42>:  mov    qword ptr [rsp], rbx
    0x0c9bc <+46>:  lock
    0x0c9bd <+47>:  inc    qword ptr [rbx]

    // fault check for negative strong ref count
    // panic on failure
    0x0c9c0 <+50>:  jle    0x0ca11

    // length 7 string "Value"
    0x0c9c2 <+52>:  mov    qword ptr [rsp + 0x8], rbx
    0x0c9c7 <+57>:  lea    rdi, [rip + 0x36633]
    0x0c9ce <+64>:  push   0x7
    0x0c9d0 <+66>:  pop    rsi
    // i32 in edx, constant 42 folded all the way down
    0x0c9d1 <+67>:  push   0x2a
    0x0c9d3 <+69>:  pop    rdx

    // hello::printf::hddab0e4c0f1b5399 at main.rs:5
    0x0c9d4 <+70>:  call   0x0c922

    // length 14 string "Cloned Value:", i32 in edx
    0x0c9d9 <+75>:  mov    edx, dword ptr [rbx + 0x10]
    0x0c9dc <+78>:  lea    rdi, [rip + 0x36625]
    0x0c9e3 <+85>:  push   0xe
    0x0c9e5 <+87>:  pop    rsi

    // hello::printf::hddab0e4c0f1b5399 at main.rs:5
    0x0c9e6 <+88>:  call   0x0c922

    // downcount for ptr2
    0x0c9eb <+93>:  lea    rdi, [rsp + 0x8]
    0x0c9f0 <+98>:  call   0x0c913

    // downcount for ptr1
    0x0c9f5 <+103>: mov    rdi, rsp
    0x0c9f8 <+106>: call   0x0c913

    // restore the frame and we're done
    0x0c9fd <+111>: add    rsp, 0x10
    0x0ca01 <+115>: pop    rbx
    0x0ca02 <+116>: ret

    // exception code... out of memory (null allocated Arc)
    0x0ca03 <+117>: push   0x18
    0x0ca05 <+119>: pop    rdi
    0x0ca06 <+120>: push   0x8
    0x0ca08 <+122>: pop    rsi
    0x0ca09 <+123>: call   qword ptr [rip + 0x47529]
    0x0ca0f <+129>: ud2
    0x0ca11 <+131>: ud2
    0x0ca13 <+133>: ud2    // more code associated with EH cleanup, 
    // this is called by EH handlers     0x0ca15 <+135>: mov    rbx, rax
    0x0ca18 <+138>: jmp    0x0ca27            ; <+153> at main.rs    // unwind logic
    0x0ca1a <+140>: mov    rbx, rax    // downcount ptr2
    0x0ca1d <+143>: lea    rdi, [rsp + 0x8]
    0x0ca22 <+148>: call   0x0c913    // downcount ptr1
    0x0ca27 <+153>: mov    rdi, rsp
    0x0ca2a <+156>: call   0x0c913    0x0ca2f <+161>: mov    rdi, rbx
    0x0ca32 <+164>: call   0x0a040            ; _Unwind_Resume    0x0ca37 <+169>: ud2
    0x0ca39 <+171>: call   qword ptr [rip + 0x47309]
    0x0ca3f <+177>: ud2

Compareable C++ code generated with clang for this ABI was 298 bytes, generally due to inlining choices. Without and that was without the EH stuff. So the atomics are cheaper. Rust consistently uses helpers to downcount and release which results in less binary. Even counting the shared helpers we get about 220 bytes of Rust codegen.

For comparison, here’s the shared code for Arc pointer downcounts.

// down counts Arc
hello`core::ptr::drop_in_place
$LT$alloc..sync..Arc$LT$i32$GT$$GT$::hf94c93c50ebb536d:

    // get the stored count address from the Arc pointer
    0x0c913 <+0>:  mov    rax, qword ptr [rdi]

    // decrement the strong count
    0x0c916 <+3>:  lock
    0x0c917 <+4>:  dec    qword ptr [rax]
    0x0c91a <+7>:  jne    0x0c921            ; <+14> at mod.rs:490:1

    // we take care of the weak counts if the strong count fell to zero
    // alloc::sync::Arc$LT$T$GT$::drop_slow at sync.rs:1271:26
    0x0c91c <+9>:  jmp    0x0c8b9            
 
    // this is our exit if the strong count is still positive
    0x0c921 <+14>: ret

// downcount weak count
hello`alloc::sync::Arc$LT$T$GT$::drop_slow::h7848796d67ee9ba2:
    0x0c8b9 <+0>:  mov    rdi, qword ptr [rdi]
    // if weak count is -1 then this thing never goes away, 
    // it's static lifetime but being passed around to things
    // that work on shared lifetime
    0x0c8bc <+3>:  cmp    rdi, -0x1
    0x0c8c0 <+7>:  je     0x0c8d5            ; <+28> at sync.rs:1272:6

    // decrement the weak count
    0x0c8c2 <+9>:  lock
    0x0c8c3 <+10>: dec    qword ptr [rdi + 0x8]
    0x0c8c7 <+14>: jne    0x0c8d5            ; <+28> at sync.rs:1272:6

    // when the weak count drops to zero we nuke it
    0x0c8c9 <+16>: push   0x18
    0x0c8cb <+18>: pop    rsi
    0x0c8cc <+19>: push   0x8
    0x0c8ce <+21>: pop    rdx

    // this is a tail call to the deallocator
    0x0c8cf <+22>: jmp    qword ptr [rip + 0x472db]

    // this is our exit if we didn't lower the weak count to zero
    0x0c8d5 <+28>: ret

The above is largely the same as the non atomic case, but with the lock prefix.

Functor Test Cases

I started with what I consider to be the most generic pattern with the functor on the heap, hence we use Box<dyn Fn(i32)> as the data type for the functor. This adds a little extra box processing but actually it ends up being pretty good.

#[inline(never)]
fn printf(s: &str, i:i32) {
  println!("{}{}", s, i);
}

#[inline(never)]
fn call_functor(f: Box<dyn Fn(i32)>) {
    f(64);
}

#[inline(never)]
fn main() {
    let a = 32;
    let b = 16;
    call_functor(
      Box::new( move |x| {
        printf("f1 ", a + x);
      })
    );
    call_functor(
      Box::new( move |x| {
        printf("f2 ", b + x);
      })
    );
}

The disassembly is very economical. The normal Rust convention of transfering ownership on call works very well with the functor being drop by shared code. It could have been stored but it wasn’t in this case. Had it been stored that, too, would have been a normal ownership transfer.

This is the code to create two such functors use them with call_functor.

hello`hello::main::hc72431543d354046:
    // allocate one temp 
    // note that rax has trash, this is just cheap stack reservation
    0x0ca3b <+0>:  push   rax

    // allocate boxed functor
    ; alloc::alloc::exchange_malloc::hb0a7f17b28ecb40b at alloc.rs:324
    0x0ca3c <+1>:  call   0x0c955

    // setup for the 1st functor call, capture 0x20 (32)
    0x0ca41 <+6>:  mov    dword ptr [rax], 0x20

    // rsi (arg2) is the metadata and rdi (arg1) is the data...
    0x0ca47 <+12>: lea    rsi, [rip + 0x4489a]
    0x0ca4e <+19>: mov    rdi, rax

    // invoke the functor, note that as usual it takes ownership
    // so it cleans up!
    // hello::call_functor::h5816b5e009ebdb16 at main.rs:9
    0x0ca51 <+22>: call   0x0c9f9

    // same code again... different functor, 
    // there are two of these to avoid inlining any constants
    // into call_functor
    // alloc::alloc::exchange_malloc::hb0a7f17b28ecb40b at alloc.rs:324
    0x0ca56 <+27>: call   0x0c955            ;

    // setup for the 2nd functor call, capture 0x10 (16)
    0x0ca5b <+32>: mov    dword ptr [rax], 0x10
    0x0ca61 <+38>: lea    rsi, [rip + 0x448b0]
    0x0ca68 <+45>: mov    rdi, rax

    // restore the stack
    0x0ca6b <+48>: pop    rax

    // tail call for the 2nd functor invocation
    // hello::call_functor::h5816b5e009ebdb16 at main.rs:9
    0x0ca6c <+49>: jmp    0x0c9f9

This is the code invoking an owned functor (hence call and drop).

hello`hello::call_functor::h5816b5e009ebdb16:
    // stash rbx, preserved reg and erect frame
    0x0c9f9 <+0>:  push   rbx
    0x0c9fa <+1>:  sub    rsp, 0x10

    // we want our metadata ptr in rax so we can dereference it later
    // with and offset at the call to get the address of the function
    0x0c9fe <+5>:  mov    rax, rsi

    // save rsi and rdi (the functor) for delete
    0x0ca01 <+8>:  mov    qword ptr [rsp], rdi
    0x0ca05 <+12>: mov    qword ptr [rsp + 0x8], rsi

    // arg 0x40 goes in rsi, normal arg (arg2)
    0x0ca0a <+17>: push   0x40
    0x0ca0c <+19>: pop    rsi

    // invoke the function
    // rdi still has our data pointer to captured state
    // rsi was set above to be actual argument
    0x0ca0d <+20>: call   qword ptr [rax + 0x28]

    // get the box ptr using stack pointer address
    0x0ca10 <+23>: mov    rdi, rsp

    // drop the boxed functor
    0x0ca13 <+26>: call   0x0c918

    // and we're done, restore frame and registers
    0x0ca18 <+31>: add    rsp, 0x10
    0x0ca1c <+35>: pop    rbx
    0x0ca1d <+36>: ret

    // all this is EH stuff that mostly doesn't run
    // but it's cleanup we might need.
    0x0ca1e <+37>: mov    rbx, rax
    0x0ca21 <+40>: mov    rdi, rsp    // drop boxed functor
    0x0ca24 <+43>: call   0x0c918
    0x0ca29 <+48>: mov    rdi, rbx
    0x0ca2c <+51>: call   0x0a040            ; _Unwind_Resume
    0x0ca31 <+56>: ud2   
    0x0ca33 <+58>: call   qword ptr [rip + 0x4730f]
    0x0ca39 <+64>: ud2

There were vtables/metadata for the functors, the pointer to the called code is marked with ***. Note that I removed the high order bits from the address so the numbers look small. They were really at a large base address.

0x0a12e8: 0x000000000000c954 0x0000000000000004
0x0a12f8: 0x0000000000000004 0x000000000000c8f6
0x0a1308: 0x000000000000ca71 *** 0x00000000ca71

0x0a1318: 0x000000000000c954 0x0000000000000004
0x0a1328: 0x0000000000000004 0x000000000000c907
0x0a1338: 0x009999000000ca7f *** 0x00000000ca7f

The actual code for the lambdas is very tight.

hello`hello::main::_$u7b$$u7b$closure$u7d$$u7d$::hfb85f8fef4d880bd:
    // do the add, rdi has our captured data
    // esi had our arg2, it becomes the new arg2
    0x0ca71 <+0>: add    esi, dword ptr [rdi]

    // we get the string literal as arg1
    0x0ca73 <+2>: lea    rdi, [rip + 0x36587]

    // tail call to printf
    0x0ca7a <+9>: jmp    0x0c98e

The other functor is identical, it’s just a different string literal.

hello`hello::main::_$u7b$$u7b$closure$u7d$$u7d$::ha4e36a648bf249c7:
    0x0ca7f <+0>: add    esi, dword ptr [rdi]
    0x0ca81 <+2>: lea    rdi, [rip + 0x3657c]
    0x0ca88 <+9>: jmp    0x0c98e

And lastly, the simplest functor setup where we borrow a functor on the stack. This is the one that is most analagous the to the minimal old-school C “use a function pointer and a void *" approach. But this is typesafe.

The generated code is staggeringly good. If the metadata were smaller, it would beat even the bespoke C code I wrote as a floor for this approach.

#[inline(never)]
fn printf(s: &str, i:i32) {
  println!("{}{}", s, i);
}

#[inline(never)]
fn call_functor(f: &dyn Fn(i32) -> ()) {
    f(64);
}

#[inline(never)]
fn main() {
    let a = 32;
    let b = 16;
    call_functor(
      & move|x| {
          printf("f1 ", a + x);
      }
    );
    call_functor(
      & move |x| {
        printf("f2 ", b + x);
      }
    );
}

It’s still a universal functor call but logic is super simple!

hello`hello::call_functor::h9b919e25747ab248:
    // we don't own the memory this time so it's just a call. 
    // We compute the target address from the metadata
    0x0c984 <+0>: mov    rax, qword ptr [rsi + 0x28]

    // rdi has the captured state, it flows like a "this" pointer
    // the new arg is in rsi and we tail call to the function
    0x0c988 <+4>: push   0x40
    0x0c98a <+6>: pop    rsi

    // tail call the lambda!
    0x0c98b <+7>: jmp    rax

The setup for the call is likewise super simple!

hello`hello::main::hc72431543d354046:
    // allocate 8 bytes of temp storage on the stack (rax is junk)
    0x0c98d <+0>:  push   rax

    // rdi is our storage pointer (on the stack)
    0x0c98e <+1>:  mov    rdi, rsp

    // capture the value 32 (note this is a dword)
    0x0c991 <+4>:  mov    dword ptr [rdi], 0x20
    // pass the metadata
    0x0c997 <+10>: lea    rsi, [rip + 0x4494a]
    // rdi, rsi have the functor data and metadata
    // now use call_functor above
    0x0c99e <+17>: call   0x0c984

    // same thing again, this time we use rsp+4 for storage and
    // capture the value 16
    0x0c9a3 <+22>: lea    rdi, [rsp + 0x4]
    0x0c9a8 <+27>: mov    dword ptr [rdi], 0x10
    0x0c9ae <+33>: lea    rsi, [rip + 0x44963]

    // again rdi, rsi is the functor
    0x0c9b5 <+40>: call   0x0c984
    0x0c9ba <+45>: pop    rax
    0x0c9bb <+46>: ret

The closure code is not shown, it’s exactly the same as the previous case — only the relative rip offset to call printf changed because things slid around a little.

The code above is very competitive with the best-case C code!

In C++ using minimal old-school void * and captured state on the stack

main                 86-28=58 (28 bytes was security cookie)
call_func_ptr              10
lambda1                    13
lambda1                    13
                        -----
                           94
                        =====

In Rust, type safe, captured state on the stack also:

main                       47
call_functor                9
lambda1                    14
lambda1                    14
                        -----
                           84
                        =====

However, Rust does have some metadata — 48 bytes of tables for each lambda. That’s 96 total metadata bytes. C had no metadata cost. With metadata the Rust cost is 180 bytes.

The simplest C++ functors test was:

main           89 + 112 = 201
call_functor              113
lambda1                    19
lambda1                    19
conditional dealloc        14
this calc                   5
destructor                 44
copy constructor           19
                        -----
                          434
                        =====

Closing Thoughts

Clearly the Rust runtime model has had a lot of thinking and many choices were made for economy. In my view these will translate directly to speed.

The C++ choices feel bloated by comparison. I have to say I was very disappointed by the choices. There is no law that says C++ has to have massive functors and indeed you could make your own frugal::function but I continue to find that idiomatic "modern" C++ results in what I can only call self-defeating bloat.

A teardown of Rust pointer types and functors

Introduction

Reference Counter Pointer Test Cases

Functor Test Cases

Closing Thoughts

Written by Rico Mariani

No responses yet