A teardown of Rust pointer types and functors

Rico Mariani
12 min readMay 12, 2023

--

Introduction

I wanted to come back to some of my tests of key structures and look in detail at what Rust does so I made test cases that were analogous to the C++ versions I did a few weeks ago. I had a pretty good idea what to expect but I was actually surprised, the code is better than I expected across the board pretty much.

Let’s look at the first program, this is vanilla (not atomic) reference counted pointers. The codegen we’ll see is optimized for space.

Reference Counter Pointer Test Cases

use std::rc::Rc;

// hoist the formatting out so that it doesn't contaminate
// the cost of shared_ptr
#[inline(never)]
fn printf(s: &str, i:i32) {
println!("{}{}", s, i);
}

// what we want to understand in detail
#[inline(never)]
fn shared_ptr() {
// Create an Rc pointer to an i32 value
let ptr1: Rc<i32> = Rc::new(42);
let ptr2 = Rc::clone(&ptr1);

// equivalent to a C style printf call (it isn't really varargs)
printf("Value: ", *ptr1);
printf("Cloned Value: ", *ptr2);
}

#[inline(never)]
fn main() {
shared_ptr();
println!("Hello, world!");
}

As we dig into this you’ll notice that the calling convention is different. That’s because I did these experiments with Linux to make things easy on myself. So the ABI is not the Windows ABI. This doesn’t make too much difference but you do see that frame setup is a bit lighter.

System V X64 ABI (Application Binary Interface):

General-purpose registers:

  • RDI: Used for the first function argument.
  • RSI: Used for the second function argument.
  • RDX: Used for the third function argument.
  • RCX: Used for the fourth function argument.
  • R8 and R9: Used for the fifth and sixth function arguments, respectively.
  • RAX: Used for the return value.
  • RBX, RBP, R12-R15: Callee-saved registers that must be preserved across function calls.

Floating-point and SIMD registers:

  • XMM0-XMM7: Used for passing and returning floating-point or SIMD values.

The main thing we have to look at is the codegen for the shared_ptr function.

hello`hello::shared_ptr::h770517f0dc95aff5:
// stash rbx, we use it and it's preserved
0x0c967 <+0>: push rbx

// allocate room for ptr1 and ptr2
0x0c968 <+1>: sub rsp, 0x10

// alloc i32 shared ptr value 42 (rdi is arg1)
0x0c96c <+5>: push 0x2a
0x0c96e <+7>: pop rdi

// alloc::rc::Rc$LT$T$GT$::new::h6e5de24059a41191 at rc.rs:374
0x0c96f <+8>: call 0x0ca63 ;

// store returned Rc<i32> into ptr1
0x0c974 <+13>: mov qword ptr [rsp], rax

// increment the ref count (not atomic)
// [inlined] alloc::rc::RcInnerPtr::inc_strong
0x0c978 <+17>: inc qword ptr [rax]

// this is a sanity check, the count should be non-zero
0x0c97b <+20>: je 0x0c9c1 ; <+90>

// rbx is preserved
0x0c97d <+22>: mov rbx, rax

// save ptr2
0x0c980 <+25>: mov qword ptr [rsp + 0x8], rax

// fetch the value out of the shared pointer, this will be arg3
0x0c985 <+30>: mov edx, dword ptr [rax + 0x10]

// "Value:"
0x0c988 <+33>: lea rdi, [rip + 0x36672]
// length 7
0x0c98f <+40>: push 0x7
0x0c991 <+42>: pop rsi

// arg1 is rdi (string), arg2 is rsi (length), arg3 is rdx
// hello::printf::hddab0e4c0f1b5399 at main.rs:5
0x0c992 <+43>: call 0x0c8fb

// now load the value out of ptr2 which is really the same as ptr1
// but now stored in rbx
0x0c997 <+48>: mov edx, dword ptr [rbx + 0x10]

// get the string into rd1
0x0c99a <+51>: lea rdi, [rip + 0x36667]

// length 14
0x0c9a1 <+58>: push 0xe
0x0c9a3 <+60>: pop rsi

// arg1 is rdi (string), arg2 is rsi (length), arg3 is rdx
// hello::printf::hddab0e4c0f1b5399 at main.rs:5
0x0c9a4 <+61>: call 0x0c8fb

// arg1 is ptr2, call teardown helper
0x0c9a9 <+66>: lea rdi, [rsp + 0x8]

// drop reference count ptr2
0x0c9ae <+71>: call 0x0ca94

// arg1 is ptr1, call teardown helper
0x0c9b3 <+76>: mov rdi, rsp

// drop reference count ptr1
0x0c9b6 <+79>: call 0x0ca94

// clear the frame and we're done
0x0c9bb <+84>: add rsp, 0x10
0x0c9bf <+88>: pop rbx
0x0c9c0 <+89>: ret
    // assorted things for exception cases not in main flow
0x0c9c1 <+90>: ud2 // undefined opcode, for sure crash
----
0x0c9c3 <+92>: ud2
----
// pointer teardown
0x0c9c5 <+94>: mov rbx, rax
0x0c9c8 <+97>: lea rdi, [rsp + 0x8]
// drop ptr2
0x0c9cd <+102>: call 0x0c8f6
// drop ptr1
0x0c9d2 <+107>: mov rdi, rsp
0x0c9d5 <+110>: call 0x0c8f6
0x0c9da <+115>: mov rdi, rbx
// _Unwind_Resume
0x0c9dd <+118>: call 0x0a040
0x0c9e2 <+123>: ud2
// drop helper for Arc, drops arg1 (rdi)
// strong then weak
// note not atomic!
0x0ca94 <+0>: mov rdi, qword ptr [rdi]
0x0ca97 <+3>: dec qword ptr [rdi]
0x0ca9a <+6>: jne 0x0caa2 ; <+14> at rc.rs:1604:6
0x0ca9c <+8>: dec qword ptr [rdi + 0x8]
0x0caa0 <+12>: je 0x0caa3 ; <+15> at rc.rs
0x0caa2 <+14>: ret
// dealloc if needed
0x0caa3 <+15>: push 0x18
0x0caa5 <+17>: pop rsi
0x0caa6 <+18>: push 0x8
0x0caa8 <+20>: pop rdx
0x0caa9 <+21>: jmp qword ptr [rip + 0x47101]

If we look at this again considering the atomic version Arc the codegen is very similar.

use std::sync::Arc;

#[inline(never)]
fn printf(s: &str, i:i32) {
println!("{}{}", s, i);
}

// what we want to understand in detail
#[inline(never)]
fn shared_ptr() {
// Create an Rc pointer to an i32 value
let ptr1: Arc<i32> = Arc::new(42);
let ptr2 = Arc::clone(&ptr1);

// equivalent to a C style printf call (it isn't really varargs)
printf("Value: ", *ptr1);
printf("Cloned Value: ", *ptr2);
}

#[inline(never)]
fn main() {
shared_ptr();
println!("Hello, world!");
}

The helper functions are the atomic ones, but the pattern is the same. The inline code uses the lock prefix now.

hello`hello::shared_ptr::h770517f0dc95aff5:
// preserve rbx, space for ptr1 & ptr2
0x0c98e <+0>: push rbx
0x0c98f <+1>: sub rsp, 0x10

// alloc args rsi, rdi
0x0c993 <+5>: push 0x18
0x0c995 <+7>: pop rdi
0x0c996 <+8>: push 0x8
0x0c998 <+10>: pop rsi
// alloc::sync::Arc$LT$T$GT$::new
0x0c999 <+11>: call qword ptr [rip + 0x471a9]

// failed alloc case
0x0c99f <+17>: test rax, rax
0x0c9a2 <+20>: je 0x0ca03 ; <+117> at main.rs

// save ptr1 in rbx
0x0c9a4 <+22>: mov rbx, rax
0x0c9a7 <+25>: push 0x1
0x0c9a9 <+27>: pop rax

// set ref count 1 in weak and strong count
0x0c9aa <+28>: mov qword ptr [rbx], rax
0x0c9ad <+31>: mov qword ptr [rbx + 0x8], rax

// store the 42 into the value part
0x0c9b1 <+35>: mov dword ptr [rbx + 0x10], 0x2a

// copy rbx into ptr2, not atomic inc.
0x0c9b8 <+42>: mov qword ptr [rsp], rbx
0x0c9bc <+46>: lock
0x0c9bd <+47>: inc qword ptr [rbx]

// fault check for negative strong ref count
// panic on failure
0x0c9c0 <+50>: jle 0x0ca11

// length 7 string "Value"
0x0c9c2 <+52>: mov qword ptr [rsp + 0x8], rbx
0x0c9c7 <+57>: lea rdi, [rip + 0x36633]
0x0c9ce <+64>: push 0x7
0x0c9d0 <+66>: pop rsi
// i32 in edx, constant 42 folded all the way down
0x0c9d1 <+67>: push 0x2a
0x0c9d3 <+69>: pop rdx

// hello::printf::hddab0e4c0f1b5399 at main.rs:5
0x0c9d4 <+70>: call 0x0c922

// length 14 string "Cloned Value:", i32 in edx
0x0c9d9 <+75>: mov edx, dword ptr [rbx + 0x10]
0x0c9dc <+78>: lea rdi, [rip + 0x36625]
0x0c9e3 <+85>: push 0xe
0x0c9e5 <+87>: pop rsi

// hello::printf::hddab0e4c0f1b5399 at main.rs:5
0x0c9e6 <+88>: call 0x0c922

// downcount for ptr2
0x0c9eb <+93>: lea rdi, [rsp + 0x8]
0x0c9f0 <+98>: call 0x0c913

// downcount for ptr1
0x0c9f5 <+103>: mov rdi, rsp
0x0c9f8 <+106>: call 0x0c913

// restore the frame and we're done
0x0c9fd <+111>: add rsp, 0x10
0x0ca01 <+115>: pop rbx
0x0ca02 <+116>: ret
    // exception code... out of memory (null allocated Arc)
0x0ca03 <+117>: push 0x18
0x0ca05 <+119>: pop rdi
0x0ca06 <+120>: push 0x8
0x0ca08 <+122>: pop rsi
0x0ca09 <+123>: call qword ptr [rip + 0x47529]
0x0ca0f <+129>: ud2
0x0ca11 <+131>: ud2
0x0ca13 <+133>: ud2
// more code associated with EH cleanup,
// this is called by EH handlers
0x0ca15 <+135>: mov rbx, rax
0x0ca18 <+138>: jmp 0x0ca27 ; <+153> at main.rs
// unwind logic
0x0ca1a <+140>: mov rbx, rax
// downcount ptr2
0x0ca1d <+143>: lea rdi, [rsp + 0x8]
0x0ca22 <+148>: call 0x0c913
// downcount ptr1
0x0ca27 <+153>: mov rdi, rsp
0x0ca2a <+156>: call 0x0c913
0x0ca2f <+161>: mov rdi, rbx
0x0ca32 <+164>: call 0x0a040 ; _Unwind_Resume
0x0ca37 <+169>: ud2
0x0ca39 <+171>: call qword ptr [rip + 0x47309]
0x0ca3f <+177>: ud2

Compareable C++ code generated with clang for this ABI was 298 bytes, generally due to inlining choices. Without and that was without the EH stuff. So the atomics are cheaper. Rust consistently uses helpers to downcount and release which results in less binary. Even counting the shared helpers we get about 220 bytes of Rust codegen.

For comparison, here’s the shared code for Arc pointer downcounts.

// down counts Arc
hello`core::ptr::drop_in_place
$LT$alloc..sync..Arc$LT$i32$GT$$GT$::hf94c93c50ebb536d:

// get the stored count address from the Arc pointer
0x0c913 <+0>: mov rax, qword ptr [rdi]

// decrement the strong count
0x0c916 <+3>: lock
0x0c917 <+4>: dec qword ptr [rax]
0x0c91a <+7>: jne 0x0c921 ; <+14> at mod.rs:490:1

// we take care of the weak counts if the strong count fell to zero
// alloc::sync::Arc$LT$T$GT$::drop_slow at sync.rs:1271:26
0x0c91c <+9>: jmp 0x0c8b9

// this is our exit if the strong count is still positive
0x0c921 <+14>: ret

// downcount weak count
hello`alloc::sync::Arc$LT$T$GT$::drop_slow::h7848796d67ee9ba2:
0x0c8b9 <+0>: mov rdi, qword ptr [rdi]
// if weak count is -1 then this thing never goes away,
// it's static lifetime but being passed around to things
// that work on shared lifetime
0x0c8bc <+3>: cmp rdi, -0x1
0x0c8c0 <+7>: je 0x0c8d5 ; <+28> at sync.rs:1272:6

// decrement the weak count
0x0c8c2 <+9>: lock
0x0c8c3 <+10>: dec qword ptr [rdi + 0x8]
0x0c8c7 <+14>: jne 0x0c8d5 ; <+28> at sync.rs:1272:6

// when the weak count drops to zero we nuke it
0x0c8c9 <+16>: push 0x18
0x0c8cb <+18>: pop rsi
0x0c8cc <+19>: push 0x8
0x0c8ce <+21>: pop rdx

// this is a tail call to the deallocator
0x0c8cf <+22>: jmp qword ptr [rip + 0x472db]

// this is our exit if we didn't lower the weak count to zero
0x0c8d5 <+28>: ret

The above is largely the same as the non atomic case, but with the lock prefix.

Functor Test Cases

I started with what I consider to be the most generic pattern with the functor on the heap, hence we use Box<dyn Fn(i32)> as the data type for the functor. This adds a little extra box processing but actually it ends up being pretty good.

#[inline(never)]
fn printf(s: &str, i:i32) {
println!("{}{}", s, i);
}

#[inline(never)]
fn call_functor(f: Box<dyn Fn(i32)>) {
f(64);
}

#[inline(never)]
fn main() {
let a = 32;
let b = 16;
call_functor(
Box::new( move |x| {
printf("f1 ", a + x);
})
);
call_functor(
Box::new( move |x| {
printf("f2 ", b + x);
})
);
}

The disassembly is very economical. The normal Rust convention of transfering ownership on call works very well with the functor being drop by shared code. It could have been stored but it wasn’t in this case. Had it been stored that, too, would have been a normal ownership transfer.

This is the code to create two such functors use them with call_functor.

hello`hello::main::hc72431543d354046:
// allocate one temp
// note that rax has trash, this is just cheap stack reservation
0x0ca3b <+0>: push rax

// allocate boxed functor
; alloc::alloc::exchange_malloc::hb0a7f17b28ecb40b at alloc.rs:324
0x0ca3c <+1>: call 0x0c955

// setup for the 1st functor call, capture 0x20 (32)
0x0ca41 <+6>: mov dword ptr [rax], 0x20

// rsi (arg2) is the metadata and rdi (arg1) is the data...
0x0ca47 <+12>: lea rsi, [rip + 0x4489a]
0x0ca4e <+19>: mov rdi, rax

// invoke the functor, note that as usual it takes ownership
// so it cleans up!
// hello::call_functor::h5816b5e009ebdb16 at main.rs:9
0x0ca51 <+22>: call 0x0c9f9

// same code again... different functor,
// there are two of these to avoid inlining any constants
// into call_functor
// alloc::alloc::exchange_malloc::hb0a7f17b28ecb40b at alloc.rs:324
0x0ca56 <+27>: call 0x0c955 ;

// setup for the 2nd functor call, capture 0x10 (16)
0x0ca5b <+32>: mov dword ptr [rax], 0x10
0x0ca61 <+38>: lea rsi, [rip + 0x448b0]
0x0ca68 <+45>: mov rdi, rax

// restore the stack
0x0ca6b <+48>: pop rax

// tail call for the 2nd functor invocation
// hello::call_functor::h5816b5e009ebdb16 at main.rs:9
0x0ca6c <+49>: jmp 0x0c9f9

This is the code invoking an owned functor (hence call and drop).

hello`hello::call_functor::h5816b5e009ebdb16:
// stash rbx, preserved reg and erect frame
0x0c9f9 <+0>: push rbx
0x0c9fa <+1>: sub rsp, 0x10

// we want our metadata ptr in rax so we can dereference it later
// with and offset at the call to get the address of the function
0x0c9fe <+5>: mov rax, rsi

// save rsi and rdi (the functor) for delete
0x0ca01 <+8>: mov qword ptr [rsp], rdi
0x0ca05 <+12>: mov qword ptr [rsp + 0x8], rsi

// arg 0x40 goes in rsi, normal arg (arg2)
0x0ca0a <+17>: push 0x40
0x0ca0c <+19>: pop rsi

// invoke the function
// rdi still has our data pointer to captured state
// rsi was set above to be actual argument
0x0ca0d <+20>: call qword ptr [rax + 0x28]

// get the box ptr using stack pointer address
0x0ca10 <+23>: mov rdi, rsp

// drop the boxed functor
0x0ca13 <+26>: call 0x0c918

// and we're done, restore frame and registers
0x0ca18 <+31>: add rsp, 0x10
0x0ca1c <+35>: pop rbx
0x0ca1d <+36>: ret
    // all this is EH stuff that mostly doesn't run
// but it's cleanup we might need.
0x0ca1e <+37>: mov rbx, rax
0x0ca21 <+40>: mov rdi, rsp
// drop boxed functor
0x0ca24 <+43>: call 0x0c918
0x0ca29 <+48>: mov rdi, rbx
0x0ca2c <+51>: call 0x0a040 ; _Unwind_Resume
0x0ca31 <+56>: ud2
0x0ca33 <+58>: call qword ptr [rip + 0x4730f]
0x0ca39 <+64>: ud2

There were vtables/metadata for the functors, the pointer to the called code is marked with ***. Note that I removed the high order bits from the address so the numbers look small. They were really at a large base address.

0x0a12e8: 0x000000000000c954 0x0000000000000004
0x0a12f8: 0x0000000000000004 0x000000000000c8f6
0x0a1308: 0x000000000000ca71 *** 0x00000000ca71

0x0a1318: 0x000000000000c954 0x0000000000000004
0x0a1328: 0x0000000000000004 0x000000000000c907
0x0a1338: 0x009999000000ca7f *** 0x00000000ca7f

The actual code for the lambdas is very tight.

hello`hello::main::_$u7b$$u7b$closure$u7d$$u7d$::hfb85f8fef4d880bd:
// do the add, rdi has our captured data
// esi had our arg2, it becomes the new arg2
0x0ca71 <+0>: add esi, dword ptr [rdi]

// we get the string literal as arg1
0x0ca73 <+2>: lea rdi, [rip + 0x36587]

// tail call to printf
0x0ca7a <+9>: jmp 0x0c98e

The other functor is identical, it’s just a different string literal.

hello`hello::main::_$u7b$$u7b$closure$u7d$$u7d$::ha4e36a648bf249c7:
0x0ca7f <+0>: add esi, dword ptr [rdi]
0x0ca81 <+2>: lea rdi, [rip + 0x3657c]
0x0ca88 <+9>: jmp 0x0c98e

And lastly, the simplest functor setup where we borrow a functor on the stack. This is the one that is most analagous the to the minimal old-school C “use a function pointer and a void *" approach. But this is typesafe.

The generated code is staggeringly good. If the metadata were smaller, it would beat even the bespoke C code I wrote as a floor for this approach.

#[inline(never)]
fn printf(s: &str, i:i32) {
println!("{}{}", s, i);
}

#[inline(never)]
fn call_functor(f: &dyn Fn(i32) -> ()) {
f(64);
}

#[inline(never)]
fn main() {
let a = 32;
let b = 16;
call_functor(
& move|x| {
printf("f1 ", a + x);
}
);
call_functor(
& move |x| {
printf("f2 ", b + x);
}
);
}

It’s still a universal functor call but logic is super simple!

hello`hello::call_functor::h9b919e25747ab248:
// we don't own the memory this time so it's just a call.
// We compute the target address from the metadata
0x0c984 <+0>: mov rax, qword ptr [rsi + 0x28]

// rdi has the captured state, it flows like a "this" pointer
// the new arg is in rsi and we tail call to the function
0x0c988 <+4>: push 0x40
0x0c98a <+6>: pop rsi

// tail call the lambda!
0x0c98b <+7>: jmp rax

The setup for the call is likewise super simple!

hello`hello::main::hc72431543d354046:
// allocate 8 bytes of temp storage on the stack (rax is junk)
0x0c98d <+0>: push rax

// rdi is our storage pointer (on the stack)
0x0c98e <+1>: mov rdi, rsp

// capture the value 32 (note this is a dword)
0x0c991 <+4>: mov dword ptr [rdi], 0x20
// pass the metadata
0x0c997 <+10>: lea rsi, [rip + 0x4494a]
// rdi, rsi have the functor data and metadata
// now use call_functor above
0x0c99e <+17>: call 0x0c984

// same thing again, this time we use rsp+4 for storage and
// capture the value 16
0x0c9a3 <+22>: lea rdi, [rsp + 0x4]
0x0c9a8 <+27>: mov dword ptr [rdi], 0x10
0x0c9ae <+33>: lea rsi, [rip + 0x44963]

// again rdi, rsi is the functor
0x0c9b5 <+40>: call 0x0c984
0x0c9ba <+45>: pop rax
0x0c9bb <+46>: ret

The closure code is not shown, it’s exactly the same as the previous case — only the relative rip offset to call printf changed because things slid around a little.

The code above is very competitive with the best-case C code!

In C++ using minimal old-school void * and captured state on the stack

main                 86-28=58 (28 bytes was security cookie)
call_func_ptr 10
lambda1 13
lambda1 13
-----
94
=====

In Rust, type safe, captured state on the stack also:

main                       47
call_functor 9
lambda1 14
lambda1 14
-----
84
=====

However, Rust does have some metadata — 48 bytes of tables for each lambda. That’s 96 total metadata bytes. C had no metadata cost. With metadata the Rust cost is 180 bytes.

The simplest C++ functors test was:

main           89 + 112 = 201
call_functor 113
lambda1 19
lambda1 19
conditional dealloc 14
this calc 5
destructor 44
copy constructor 19
-----
434
=====

Closing Thoughts

Clearly the Rust runtime model has had a lot of thinking and many choices were made for economy. In my view these will translate directly to speed.

The C++ choices feel bloated by comparison. I have to say I was very disappointed by the choices. There is no law that says C++ has to have massive functors and indeed you could make your own frugal::function but I continue to find that idiomatic "modern" C++ results in what I can only call self-defeating bloat.

--

--

Rico Mariani
Rico Mariani

Written by Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.

No responses yet