std::pointer types — a tear down and discussion

Rico Mariani
31 min readApr 5, 2023

--

I was wanting to get a more complete understanding of exactly what code gets generated when we compile some typical smart pointer patterns in C++ so I created a sample app that did some of the things. I then looked at the disassembly and annotated it heavily. Now we should keep in mind that this is just one compiler, MSC, and it's on Windows, so the code would be different on Linux if only because the ABI is different and clang/gcc are likely to make some different choices overall. But I think it's still super interesting to dig into the output at this level.

Below is the program I started with. Later I go into an "overtime" section where I added a few more small functions to look at some patterns. The source for those is included below but they are very small fragments so it isn't much anyway.

Keep in mind that some of what is happening below is done just to ensure inlining doesn't make it all vanish, so there are some non-idiomatic patterns. I built with /O2 (prefer speed) which is likely to make different choices also. And I didn't enable any advanced security features or anything like that.

So YMMV. But here it is.

// Trivia: this code was generated by ChatGPT and lightly edited
#include <memory>

using namespace std;

__declspec(noinline)
void sharedPtrFunction() {
shared_ptr<int> ptr1 = make_shared<int>(10);
printf("value of ptr1: %d\n", *ptr1);
printf("Use count before ptr2: %d\n", ptr1.use_count());

shared_ptr<int> ptr2 = ptr1;
printf("value of ptr2: %d\n", *ptr2);
printf("Use count before exit: %d\n", ptr1.use_count());
}

__declspec(noinline)
void uniquePtrFunction() {
unique_ptr<int> uptr(new int(20));
printf("value of uptr: %d\n", *uptr);
}

__declspec(noinline)
void weakPtrFunction() {
shared_ptr<int> sptr = make_shared<int>(30);
weak_ptr<int> wptr = sptr;
printf("Use count before lock: %d\n", sptr.use_count());
if (auto sptr2 = wptr.lock()) {
printf("value of wptr: %d\n", *sptr2);
printf("Use count during lock: %d\n", sptr.use_count());
}
else {
printf("not reached in this test\n");
}
printf("Use count after lock: %d\n", sptr.use_count());
}

__declspec(noinline)
void oldSchoolFunction() {
int* ptr = new int(5);
printf("value of ptr: %d\n", *ptr);
delete ptr;
}

int main() {
sharedPtrFunction();
uniquePtrFunction();
weakPtrFunction();
oldSchoolFunction();
return 0;
}

Looking at the first function, it does a few things, and we can use it to tear down some typical things that happen in the prolog as well as some shared pointer patterns. It also does some unusual things. The use_count call will cause atypical code patterns to appear in the output. However, that's no big deal, we'll have other cases later that don't have those and the extra code helps to show where things are stored.

Let's review the code:

void sharedPtrFunction() {
shared_ptr<int> ptr1 = make_shared<int>(10);
printf("value of ptr1: %d\n", *ptr1);
printf("Use count before ptr2: %d\n", ptr1.use_count());

shared_ptr<int> ptr2 = ptr1;
printf("value of ptr2: %d\n", *ptr2);
printf("Use count before exit: %d\n", ptr1.use_count());
}

Looking at the first function, it does a few things, and we can use it to tear down some typical things that happen in the prolog as well as some shared pointer patterns. It also does some unusual things. The use_count call will cause atypical code patterns to appear in the output. However, that's no big deal, we'll have other cases later that don't have those and the extra code helps to show where things are stored.

And now the disassembly…

void sharedPtrFunction() {// The x64 Application Binary Interface (ABI) uses a 
// four-register fast-call calling convention by default.
// Space is allocated on the call stack as a "home storage"
// for callees to save those registers.
// Note: this is Windows ABI only. Linux is different, it has
// home storage and up to 6 registers used for arguments.
//
// What follows is the frame after the preamble has run.
//
// I have addresses in increasing order going down putting
// the top of the stack at the top. On x64 esp points to the last
// thing that was pushed, not the next free thing. i.e.
// * push is "*(--rsp) = value"
// * pop is "result = *(rsp++)";
//
// rsp - nnn -> anything pushed will go here
// rsp -> new "home storage" 0 (our callees use this space)
// rsp + 08h -> new "home storage" 1
// rsp + 10h -> new "home storage" 2
// rsp + 18h -> new "home storage" 3
// rsp + 20h -> saved rdi
// rsp + 28h -> the return address (at entry rsp pointed here)
// rsp + 30h -> "home storage" slot 0 control block address here
// rsp + 38h -> "home storage" slot 1 rbx saved here (was rsp+10h)
// rsp + 40h -> "home storage" slot 2 unused (rdi could be here?)
// rsp + 48h -> "home storage" slot 3 unused (not needed)
// use "home storage" to save rbx
01080 mov qword ptr [rsp+10h],rbx
// preserve rdi the old school way
01085 push rdi
// allocate "home storage" for functions we call
01086 sub rsp,20h
shared_ptr<int> ptr1 = make_shared<int>(10);
// allocate room for 3 qwords, for the control block plus the integer
0108A mov ecx,18h
0108F call operator new (013F0h)
// stash the smart pointer value in rbx
01094 mov rbx,rax
// and also stash the control block pointer value on the stack
01097 mov qword ptr [rsp+30h],rax
// zero xmm0 128 bits of zeros (16 bytes), used shortly
0109C xorps xmm0,xmm0
printf("value of ptr1: %d\n", *ptr1);// load the value of the string in pointer rcx
// this will be the first arg to printf
// we're actualy interspersing code for the next statement
// we're still also construting the shared pointer
0109F lea rcx,[string "value of ptr1: %d\n" (03320h)]
// get the value of the integer 10 which we plan to store
// put it in edx as arg2, where printf needs it
010A6 mov edx,0Ah
shared_ptr<int> ptr1 = make_shared<int>(10);// blast two qwords to zeros at rax, the control block for ptr1
010AB movups xmmword ptr [rax],xmm0
// set its strong use count to 1 (32 bits)
010AE mov dword ptr [rax+8],1
// set its weak use count to 1 (32 bits)
010B5 mov dword ptr [rax+0Ch],1
// load the address of the control block vtable
010BC lea rax,[std::_Ref_count_obj2<int>::`vftable']
// store the vtable address in the control block,
// remember we stored the address of the control block in rbx
// so until the above rax and rbx were the same.
// rbx is preseved across calls so we'll continue to use it later...
010C3 mov qword ptr [rbx],rax
// finally, store 10, control block offset rbx+10h is the payload
010C6 mov dword ptr [rbx+10h],0Ah
printf("value of ptr1: %d\n", *ptr1);
// now we already have rcx and rdx holding the format string and
// the value; we can print
010CD call printf (01010h)
// we can get the use count (strong) from the control block pointer
// which is still in rbx. rdx will hold the count, arg2 to printf
printf("Use count before ptr2: %d\n", ptr1.use_count());
010D2 mov edx,dword ptr [rbx+8]
// and the format string
010D5 lea rcx,[string "Use count before ptr2: %d\n"]
// and now we can print the use count (rcx, rdx args as usual)
010DC call printf (01010h)
// now we're going to make a copy of the pointer shared_ptr<int> ptr2 = ptr1;// the minimum thing we need to do is interlocked increment the
// strong count in this case we could have optimized out everything
// else pretty much but not this, well I guess this too but then we
// would have to be pretty omniscient about the use count to print
// it correctly. All the printfs are forcing normal housekeeping.
// Note we didn't actually copy the pointer here, that was optimized
// out. We could have done that copy with two 64-bit ops or one xmm
// 128 bit op. But, instead, we don't even bother.
// We're left with this:
010E1 lock inc dword ptr [rbx+8] printf("value of ptr2: %d\n", *ptr2);// as before we read the value right from rbx+10,
// we didn't actually use ptr2, the whole thing is gone
010E5 mov edx,dword ptr [rbx+10h]
010E8 lea rcx,[string "value of ptr2: %d\n" (03358h)]
010EF call printf (01010h)
printf("Use count before exit: %d\n", ptr1.use_count());// we're getting the count again, note this is not thread safe...
// there's no memory barrier it's a boring read...
// so really, don't do this :D
// note that we often generate a null check before reading the
// control block to get the use count but in this case we've
// inferred that it can't be null so we're reading the memory
// directly. The norm is to check for a null control block and
// if null use zero as the count.
010F4 mov edx,dword ptr [rbx+8]
010F7 lea rcx,[string "Use count before exit: %d\n"]
010FE call printf (01010h)
}
// now we have to do the teardown, load -1 into edi and eax
01103 mov edi,0FFFFFFFFh
01108 mov eax,edi
// interlocked exchange add (-1) to atomically downcount the strong count
// after this eax will hold what the count *was* before we changed it
0110A lock xadd dword ptr [rbx+8],eax
// if the strong count was not 1, we're done, there are still refs
// if it was 1, then it is now zero, hence time to destruct.
// in our case we will never take this branch because we know
// there is a 2nd shared pointer. We'll always get a 2 in eax here.
0110F cmp eax,1
01112 jne sharedPtrFunction+0B1h (01131h)
// load the vtable pointer from the control block
01114 mov rax,qword ptr [rbx]
// recall the layout of the control block:
// * vtable ptr (64 bits)
// * strong count (32 bits)
// * weak count (32 bits)
// setup rcx as the "this" argument for the destructor
01117 mov rcx,rbx
// call the first thing in the vtable, the dtor
// in our case this is a no-op because int doesn't need destruction
0111A call qword ptr [rax]
// Now we're going to downcount the weak count, all the strong
// pointers together share one weak count. We need to do this
// because we can't free the control block until both weak and
// strong references are gone. The last strong ref does one weak
// downcount, that means strong refs don't have to upcount both
// counts and only downcount twice rarely.
0111C mov eax,edi // -1
0111E lock xadd dword ptr [rbx+0Ch],eax
01123 cmp eax,1
01126 jne sharedPtrFunction+0B1h (01131h)
// if the weak count was 1, it's now zero and it's time to destroy
01128 mov rax,qword ptr [rbx]
0112B mov rcx,rbx
0112E call qword ptr [rax+8] // control block destructor
// we have two shared pointers this is the same code for ptr2
// this time we will take the cleanup path
01131 mov eax,edi
01133 lock xadd dword ptr [rbx+8],eax
01138 cmp eax,1
// we will never take the branch in this code
0113B jne sharedPtrFunction+0D8h (01158h)
// we will for sure call the destructor here
0113D mov rax,qword ptr [rbx]
01140 mov rcx,rbx
01143 call qword ptr [rax] // destroy the int
// now the weak pointer... for sure we will do this
01145 lock xadd dword ptr [rbx+0Ch],edi
0114A cmp edi,1
// we never take this branch, this is the last weak ref
0114D jne sharedPtrFunction+0D8h (01158h)
// this code always runs, here we clean up the control block
0114F mov rax,qword ptr [rbx] // get vtable
01152 mov rcx,rbx // load "this"
01155 call qword ptr [rax+8] // destroy the control block
// finally restore the saved rbx and fix the frame pointer
01158 mov rbx,qword ptr [rsp+38h]
0115D add rsp,20h
01161 pop rdi
01162 ret
--- include\memory
// destructor for the control block itself, this is a tail call
delete this;
// deleting a null control block pointer is a no-op
01360 test rcx,rcx
01363 je 01371h
// we get the vtable pointer from rcx which points to the object
// hence rcx is already set up to be the this pointer for the call
01365 mov rax,qword ptr [rcx]
// rdx carries the directive to do the dealloc (it's a bool)
01368 mov edx,1
// this is the virtual destructor for the control block, below as
// std::_Ref_count_obj2<int>::`scalar deleting destructor'
// tail call optimized
0136D jmp qword ptr [rax+10h]
}
// only use if the we were in the null case
01371 ret
// normal destructor for the smart pointer storage type,
// in this case its int so there is nothing to do
_Destroy_in_place(_Storage._Value);
}
01380 ret 0
// this is the actual control block destructorstd::_Ref_count_obj2<int>::`scalar deleting destructor':
// save rbx, set up a frame (didn't use home storage, wonder why)
01390 push rbx
01392 sub rsp,20h
// we get our vtable pointer directly
01396 lea rax,[std::_Ref_count_obj2<int>::`vftable']
0139D mov rbx,rcx
// we change the vtable type to this base class,
// note that when destructing vtables unwind to base types!
013A0 mov qword ptr [rcx],rax
// Now we check to see if a delete was requested (in our case
// it will be) Note that test dl,1 is an AND of dl with 1 and
// sets the flags accordingly, it's not a compare.
// If dl == 1 as we expect then the result of the AND will be 1
// and the Z flag is not set, hence je is not taken.
// we will never take this branch in our sample program
013A3 test dl,1
013A6 je 013B2h
// delete the allocated storage (control block and data)
013A8 mov edx,18h
013AD call operator delete (0142Ch)
// we're going to return "this", it's in rbx, having been moved there
// from rcx, we need it in rax
013B2 mov rax,rbx
// restore rbx and the frame pointers and we're outta here
013B5 add rsp,20h
013B9 pop rbx
013BA ret

In the above we get a good look at a bunch of things. The normal construction pattern for a smart pointer, which is pretty economical, and the teardown, which requires up to two interlocked decrements. The teardown pattern seems to take between 45 and 50 bytes depending on which registers happen to hold the pointer in question. The code pattern is always the same,

// decrements are interlocked
// this is 45 to 50 bytes of code
if (strong_refs-- == 1) {
destroy_object();
if (weak_refs-- == 1) {
destroy_control_block();
}
}

The unique pointer version of this code is really simple by comparison. Let’s have a look.

Recall the unique pointer version looked like this, and the disassembly follows.

void uniquePtrFunction() {
unique_ptr<int> uptr(new int(20));
printf("value of uptr: %d\n", *uptr);
}
void uniquePtrFunction() {
// preserve rbx and sert up a frame
01170 push rbx
01172 sub rsp,20h
unique_ptr<int> uptr(new int(20));
// allocate 4 bytes, for the int, that's it...
01176 mov ecx,4
0117B call operator new (013F0h)
printf("value of uptr: %d\n", *uptr);// the compiler has figured out that the int holds 20 so it loads
// it in a register directly
01180 mov edx,14h
// we're setting format string for printf
01185 lea rcx,[string "value of uptr: %d\n"]
// before we make the call, we save allocated memory pointer in rbx
// note rbx is preserved across calls (we had to save it too, above)
0118C mov rbx,rax
// and we store the value 20 in there
0118F mov dword ptr [rax],14h
// now we print, we have the args in rcx, rdx as usual
// we can't tail call becaue we have cleanup to do
01195 call printf (01010h)
}
// cleanup is simple, we need to free 4 bytes 100% of the time
0119A mov edx,4
0119F mov rcx,rbx
// we clean up the stack, restore rbx
011A2 add rsp,20h
011A6 pop rbx
// and tail call to delete rcx (pointer), rdx (length) set up above
011A7 jmp operator delete (0142Ch)=

The code above was basically optimal for this problem. I guess a truly psychic compiler might have figured out that it didn’t need to do the allocations at all and just printed the correct value but that’s probably asking too much.

Now let’s look at the old school version of all this, by an astonishing coincidence it’s identical except that it stores 5 in the int instead of 20. There is was no overhead in unique_ptr<>, it was the same as doing it manually (as expected). Let’s refresh our memory of the old school function:

void oldSchoolFunction() {
int* ptr = new int(5);
printf("value of ptr: %d\n", *ptr);
delete ptr;
}

Digging in…

void oldSchoolFunction() {
// same frame setup
01300 push rbx
01302 sub rsp,20h
// same alloc
int* ptr = new int(5);
01306 mov ecx,4
0130B call operator new (013F0h)
// same print (but using 5 not 20)
printf("value of ptr: %d\n", *ptr);
01310 mov edx,5
01315 lea rcx,[string "value of ptr: %d\n" (03440h)]
// same storing the allocated address in rbx
0131C mov rbx,rax
0131F mov dword ptr [rax],5
01325 call printf (01010h)
delete ptr;
// same cleanup on exit, length in rdx, pointer in rcx
0132A mov edx,4
0132F mov rcx,rbx
}
01332 add rsp,20h
01336 pop rbx
delete ptr;
// same tail call optimization
01337 jmp operator delete (0142Ch)

Can’t improve on perfection, right?

Now lets move on the the weak pointer function, it’s more interesting. There is an allocation of a shared pointer, then a weak pointer, then an upgrade to a strong reference, then the upgraded ref is destroyed then the weak reference is destroyed, and, finally, the strong reference is destroyed. It’s a lot of code…

void weakPtrFunction() {
shared_ptr<int> sptr = make_shared<int>(30);
weak_ptr<int> wptr = sptr;
printf("Use count before lock: %d\n", sptr.use_count());
if (auto sptr2 = wptr.lock()) {
printf("value of wptr: %d\n", *sptr2);
printf("Use count during lock: %d\n", sptr.use_count());
}
else {
printf("not reached in this test\n");
}
printf("Use count after lock: %d\n", sptr.use_count());
}

Here we go…

void weakPtrFunction() {
// we're gonna set up the frame like we in the strong pointer sample
// and again we'll use our home storage for our own purposes.
// rbx and rsi will be stored in the home space, another slot will
// be used to store the control block we're going to allocate but
// for reasons I do not understand rdi didn't get stored in the home
// storage in the remaining 64-bit slot. Like many other cases we
// push it old-school. Maybe that's smaller code, I dunno.
// after the preamble the frame looks like this:// rsp -> new "home storage" 0 our callees get this space
// rsp + 8 -> new "home storage" 1
// rsp + 10 -> new "home storage" 2
// rsp + 18 -> new "home storage" 3
// rsp + 20 -> saved rdi
// rsp + 28 -> the return address
// rsp + 30 -> "home storage" slot 0 control block ptr saved here
// rsp + 38 -> "home storage" slot 1 rbx saved here
// rsp + 40 -> "home storage" slot 2 rsi saved here
// rsp + 48 -> "home storage" slot 3 rdi could have gone here?
// save rbx and rsi into home storage, push rdi, alloc home storage
// we don't need space for our locals because home storage has space
011B0 mov qword ptr [rsp+10h],rbx
011B5 mov qword ptr [rsp+18h],rsi
011BA push rdi
011BB sub rsp,20h
// "home storage" is now at rsp+30, previously it was at rsp+8// we want a zero in edi for math later
011BF xor edi,edi
shared_ptr<int> sptr = make_shared<int>(30);// I have no idea why we put rdi into the calculation...
// 0+18h is still 18h. Why it was done like this, who knows...
// Maybe the 18h constant is smaller in this encoding.
// Anyhow alloc the control block like usual.
011C1 lea ecx,[rdi+18h]
011C4 call operator new (013F0h)
// stash the allocated pointer in rbx
011C9 mov rbx,rax
// save the control block on the stack for the strong pointer
// not that even though nominally there are two pointers in a shared_ptr
// one of them is not stored, it's going to go in rsi below.
011CC mov qword ptr [rsp+30h],rax
// zero out xmm0 so we can hammer in 16 nulls later
011D1 xorps xmm0,xmm0
// save the int pointer in rsi, this is where our int is stored
// this is the second pointer of the shared_ptr
011D4 lea rsi,[rbx+10h]
// clear the control block and set the strong and weak counts to 1
// then store the vtable pointer, this is standard control block
// construction.
011D8 movups xmmword ptr [rax],xmm0
011DB mov dword ptr [rax+8],1
011E2 mov dword ptr [rax+0Ch],1
011E9 lea rax,[std::_Ref_count_obj2<int>::`vftable' (03460h)]
011F0 mov qword ptr [rbx],rax
// store the value 30 in the integer
011F3 mov dword ptr [rsi],1Eh
weak_ptr<int> wptr = sptr;// we have a new weak pointer to construct, increment the weak count
// rbx is the control block, the weak count is at offet +0Ch
011F9 lock inc dword ptr [rbx+0Ch]
printf("Use count before lock: %d\n", sptr.use_count());// pull the strong use count out of the control block and print it
011FD mov edx,dword ptr [rbx+8]
01200 lea rcx,[string "Use count before lock: %d\n" (033A8h)]
01207 call printf (01010h)
// now we're trying to upgrade from weak to strong
if (auto sptr2 = wptr.lock()) {
// First we check the strong count, with no interlock, if it is
// zero then we cannot lock, the object is already destroyed.
// This will never happen in our code, it is the fast-out path for
// when Elvis has already left the building.
0120C mov eax,dword ptr [rbx+8]
0120F test eax,eax
01211 je weakPtrFunction+82h (01232h) // can't lock
// burn space (for alignment reasons?)
01213 nop dword ptr [rax]
01217 nop word ptr [rax+rax]
// now we try to upcount... this is a rendezvous of sorts with
// other threads that may be trying to downcount.
// We have the lock count that we saw in rax, we add 1 to it
// and store it in ecx, this what we want to store.
01220 lea ecx,[rax+1]
// Now we try to switch ecx into the count but only if the count
// is still equal to eax. note that eax is an implicit argument
// in this opcode.
// Basically, with one ATOMIC instruction we do all of this magic!
//
// if ([rbx+8] == eax) {
// eax preserved
// [rbx+8] = ecx; // exchange success
// set zero flag; // je will succeed
// }
// else {
// eax = [rbx+8]; // no exchange case
// clear zero flag; // je will not succeed
// }
//
01223 lock cmpxchg dword ptr [rbx+8],ecx [eax implied]
01228 je weakPtrFunction+11Ch (012CCh) // do lock body!
// If we didn't take the branch then eax is the value
// that was in [rbx+8]. If it is now zero, we can't lock,
// the object is gone. Otherwise, we try again upcounting from
// the new, updated, value.
0122E test eax,eax
01230 jne weakPtrFunction+70h (01220h) // try again
// don't take the lock path
else {
printf("not reached in this test\n");
// this code never runs in our case, it means the weak pointer
// cannot be upgraded in our code we can always get the strong
// pointer and we never race because we only have one thread.
01232 lea rcx,[string "not reached in this test\n"]
01239 call printf (01010h)
}
// we're done with the lock, the temp strong pointer goes away here
0123E mov esi,0FFFFFFFFh
// now if we never actually took the lock and we're on the early
// out path we don't do the cleanup... recall we put 0 in rdi
// near the top. Once we create the lock, rdi will be non-zero
// as we will see below. In fact rdi points to the control block
// of the temp shared_ptr.
01243 test rdi,rdi
// skip lock destruction if lock never acquired
01246 je weakPtrFunction+0C1h (01271h)
// ok now we're going to do the usual downcount operations
// esi has -1 in it so we're adding -1 like usual.
01248 mov eax,esi
0124A lock xadd dword ptr [rdi+8],eax
0124F cmp eax,1
01252 jne weakPtrFunction+0C1h (01271h)
// If needed, get the vtable pointer out rdi, which holds the
// control block and call the destructor.
01254 mov rax,qword ptr [rdi]
01257 mov rcx,rdi
0125A call qword ptr [rax]
// Like usual, we now now reduct the weak pointer count, the last
// strong pointer went away. We put the -1 back in eax,
// do the subtract (add -1) and then test
0125C mov eax,esi
0125E lock xadd dword ptr [rdi+0Ch],eax
01263 cmp eax,1
01266 jne weakPtrFunction+0C1h (01271h)
// if that was the last weak pointer (it won't be) then we are going
// to do the control block cleanup
01268 mov rax,qword ptr [rdi]
0126B mov rcx,rdi
0126E call qword ptr [rax+8]
// In any case we're going to print the current strong count
// after the lock has been released. We load it from the control
// block still in rbx even after all this time and get the count
// out of it.
printf("Use count after lock: %d\n", sptr.use_count());
01271 mov edx,dword ptr [rbx+8]
01274 lea rcx,[string "Use count after lock: %d\n" (03420h)]
0127B call printf (01010h)
}
// Finally it's all done, this is the normal cleanup, now we're
// gonna clean up our weak pointer and then the strong pointer.
// These are the usual patterns.
// Put -1 back in eax yet again, still here in esi
// this is the weak pointer count going down for weak pointer
// destruction. Weak count is at rbx+0Ch
01280 mov eax,esi
01282 lock xadd dword ptr [rbx+0Ch],eax
01287 cmp eax,1
0128A jne weakPtrFunction+0E5h (01295h)
// Recall that all the strong pointers share one weak count
// so if the weak count goes to zero it's ALL done. We tear down
// the control block right there. This won't happen in this
// particular code because we have a strong pointer outstanding.
0128C mov rax,qword ptr [rbx]
0128F mov rcx,rbx
01292 call qword ptr [rax+8]
// Now the strong pointer goes away... first down count the strong
// count and then the weak count as before, -1 still in esi.
01295 mov eax,esi
01297 lock xadd dword ptr [rbx+8],eax
0129C cmp eax,1
0129F jne weakPtrFunction+10Ch (012BCh)
// strong count went to zero, destruct the object (an int in this case)
// this code will run
012A1 mov rax,qword ptr [rbx]
012A4 mov rcx,rbx
012A7 call qword ptr [rax] // int destruction (no-op)
// now try the weak count (again)
012A9 lock xadd dword ptr [rbx+0Ch],esi
012AE cmp esi,1
012B1 jne weakPtrFunction+10Ch (012BCh)
// if the weak count went to zero then we nuke the control block
// this always runs in our case
012B3 mov rax,qword ptr [rbx]
012B6 mov rcx,rbx
012B9 call qword ptr [rax+8]
// restore rbx, and rsi, fix the frame, restore rdi and we're done
012BC mov rbx,qword ptr [rsp+38h]
012C1 mov rsi,qword ptr [rsp+40h]
012C6 add rsp,20h
012CA pop rdi
012CB ret
// the "I got the lock" body was compiled out of line if (auto sptr2 = wptr.lock()) {
// if we acquire the lock, we store the pointer in rdi,
// otherwise rdi stays null. rdi will serve as a flag that
// we did the upcounts.
012CC mov rdi,rbx
// Now here's something weird... rsi has in it a pointer to the
// actual storage for the int but we have this redundant looking
// test to make sure it's not null... if it's null then we pretend
// we didn't get the lock... only we did... sort of. But in the
// null case we will run the code that says we didn't get the lock
// but we will in fact also run the code that downcounts the
// temporary shared pointer, which is correct.
// We have to do this because at this point we already did the
// upcounts. This is kind of weird: if we're here rsi can't be
// null... I have to assume the codegen is more general for other
// cases where the data pointer might be null.
012CF test rsi,rsi
012D2 je weakPtrFunction+82h (01232h)
printf("value of wptr: %d\n", *sptr2);// rsi points to the int, not the control block
// we set it up in rdx and the format string in rcx like usual
// and call printf
012D8 mov edx,dword ptr [rsi]
012DA lea rcx,[string "value of wptr: %d\n"]
012E1 call printf (01010h)
printf("Use count during lock: %d\n", sptr.use_count());// we're still holding the lock, we print use the use count,
// this will be 2.
012E6 mov edx,dword ptr [rbx+8]
012E9 lea rcx,[string "Use count during lock: %d\n"]
012F0 call printf (01010h)
}
// now we go back to the main flow
// this is where the lock value well be torn down as we saw
012F5 jmp weakPtrFunction+8Eh (0123Eh)

Phew, that one was a bit tricky! The weak to strong upgrade is one of the most expensive patterns because there are repeated interlocks and many tests to handle all the boundary cases. Not for the faint of heart.

We can take quick look at main, it’s dead simple. It’s basically just calling the functions we defined, in order, the only magic thing it has to do is create room for the home storage.

int main() {
// create home storage and 8 extra bytes? Why 8 extra? I dunno...
01340 sub rsp,28h
sharedPtrFunction();
01344 call sharedPtrFunction (01080h)
uniquePtrFunction();
01349 call uniquePtrFunction (01170h)
weakPtrFunction();
0134E call weakPtrFunction (011B0h)
oldSchoolFunction();
01353 call oldSchoolFunction (01300h)
return 0;
// get a zero in eax for our return
01358 xor eax,eax
0135A add rsp,28h
0136E ret
}

At this point we’re done with the main program but I went into “overtime” a little and added some pretty simple sequences to see costs more directly. I’ll show the source for each one. I want to see the cost of the call and the cost of the function in question in most cases. Note that most of these disassemblies were harvested in different runs at this point so the addresses are not consistent between functions, but the are consistent within a single snip. If the code offsets look weird, that’s why.

The most important thing was I wanted to see what happens when you pass a const shared_ptr<int> & so I wrote this:

void printStuffRef(const shared_ptr<int> & p) {
printf("value:%d, count:%d\n", *p, p.use_count());
}

A suitable pointer was created and stored in the p1 variable, then we made this call:

printStuffRef(p1);
0149B lea rcx,[p1]
014A0 call printStuffRef

The call is super simple, literally getting the address of the p1 variable and passing it as the first argument. This is exactly what you would expect. Let's look at how printStuffRef develops.

void printStuffRef(const shared_ptr<int> & p) {
printf("value:%d, count:%d\n", *p, p.use_count());
// rcx points to shared pointer, which points to the control block
// and the object, we get the control block pointer and store in rax
011D0 mov rax,qword ptr [rcx+8]
011D4 test rax,rax
// we're checking for a null control block, if it's null we'll use 0
// for the use_count() rather than looking in the control block
011D7 je printStuffRef+1Eh (011EEh)
// r8 gets the count (r8d is the 32 bit version of r8)
011D9 mov r8d,dword ptr [rax+8]
// rax gets the int pointer, this will be used to load rdx for arg2
011DD mov rax,qword ptr [rcx]
// get the format string
011E0 lea rcx,[string "value:%d, count:%d\n"]
// and here we get arg2, reading the int value
011E7 mov edx,dword ptr [rax]
// tail call to printf having used the actual stored use_count
011E9 jmp printf (01010h)
// alterate choice, the control block is nil and use zero for the count
// the code is otherwise the same.
// r8d gets zero, all else is the same.
011EE xor r8d,r8d
011F1 mov rax,qword ptr [rcx]
011F4 lea rcx,[string "value:%d, count:%d\n"]
011FB mov edx,dword ptr [rax]
011FD jmp printf (01010h)

Because the use_count() call added so much junk due to its control block test, I made another example that doesn’t print the use count so I could more accurately see the const ref pattern.

void printValueOnlyRef(const shared_ptr<int>& p) {
printf("value:%d\n", *p);
}

The call pattern is identical, get the address of p1 and then call.

printValueOnlyRef(p1);
014ED lea rcx,[p1]
014F2 call printValueOnlyRef (01290h)
void printValueOnlyRef(const shared_ptr<int>& p) {  
printf("value:%d\n", *p);
// normal access is much simpler, no funky control block checks
// just get the int and go
// get the int pointer into rax (remember we have a ref)
00001290 mov rax,qword ptr [rcx]
// format string in rcx
00001293 lea rcx,[string "value:%d\n" (033A8h)]
// get the int in rdx
0000129A mov edx,dword ptr [rax]
// tail call
0000129C jmp printf (01010h)

This code is really quite good. Any cost is going to be associate with possible non-locality. The first read into rax might miss because [rcx] could be uncached. Likewise, the second read might miss because [rax] could be uncached.

The next thing I wanted to was compare the cost of calling basically the same function by value. This is of course not what you’re supposed to do but we should look at how bad it is. It’s the same it’s just missing the &.

void printStuffVal(const shared_ptr<int> p) {
printf("value:%d, count:%d\n", *p, p.use_count());
}

Now looking at the call site, already we see cost.

printStuffVal(p1);
// basically we're about to construct a temp shared pointer and pass
// it by ref. Now rsp+38 is p1's control word, if it's null,
// we don't haveto do anything (it won't be null in our case).
// If it's not null we have to upcount the shared_ref count because
// of the temp. If it's a null pointer we don't upcount it.
// We don't/can't/wouldn't-want-to keep
// counts on nulls.
014A5 mov rbx,qword ptr [rsp+38h]
014AA test rbx,rbx
014AD je main+58h (014B8h)
// ok normal case, upcount the strong count
014AF lock inc dword ptr [rbx+8]
// reload rbx (seems redundant)
014B3 mov rbx,qword ptr [rsp+38h]
// copying the two pointers from p1 into xmm0
014B8 movaps xmm0,xmmword ptr [p1]
// get address of temp shared pointer, we need this for the call too
014BD lea rcx,[rsp+20h]
// store the copyed pointers into the temp shared_ptr
// apparently rsp+offset is as good as rcx with no offset as target
014C2 movdqa xmmword ptr [rsp+20h],xmm0
014C8 call printStuffVal (01210h)
// there is no post call cleanup
// the callee cleans up the temp object

That was just the call! Let’s look at the callee side. Recall:

void printStuffVal(const shared_ptr<int> p) {
printf("value:%d, count:%d\n", *p, p.use_count());
}
void printStuffVal(const shared_ptr<int> p) {
//
// normal frame setup
01210 push rbx
01212 sub rsp,20h
printf("value:%d, count:%d\n", *p, p.use_count());
// The first part is logically the same as the ref case because
// as before we have a shared_ptr ref in rcx. The code will
// be the same including the control block guard as before.
// In this case we share the call to printf because there is
// no benefit in creating to tail calls. The difference is
// after the call to printf.
01216 mov rax,qword ptr [rcx+8]
0121A mov rbx,rcx
0121D test rax,rax
01220 je printStuffVal+18h (01228h)
// get the use count, safe to use
01222 mov r8d,dword ptr [rax+8]
01226 jmp printStuffVal+1Bh (0122Bh)
// use zero for the count if there is no control block!
01228 xor r8d,r8d
0122B mov rax,qword ptr [rcx]
0122E lea rcx,[string "value:%d, count:%d\n"]
01235 mov edx,dword ptr [rax]
01237 call printf (01010h)
}
// Now we have to do cleanup! First check for a control block
// A null control block doesn't need destruction, so we skip all
// this and just exit the function in that case.
0123C mov rbx,qword ptr [rbx+8]
01240 test rbx,rbx
01243 je printStuffVal+6Bh (0127Bh)
// if control block is present, downcount as usual
01245 mov qword ptr [rsp+30h],rdi
0124A mov edi,0FFFFFFFFh
0124F mov eax,edi
01251 lock xadd dword ptr [rbx+8],eax
01256 cmp eax,1
01259 jne printStuffVal+66h (01276h)
// main destruction if count 1
0125B mov rax,qword ptr [rbx]
0125E mov rcx,rbx
01261 call qword ptr [rax]
// optional weak pointer decrement all strongs share one weak count
01263 lock xadd dword ptr [rbx+0Ch],edi
01268 cmp edi,1
0126B jne printStuffVal+66h (01276h)
// optional control block destruction if weak count reached zero
0126D mov rax,qword ptr [rbx]
01270 mov rcx,rbx
01273 call qword ptr [rax+8]
// restore rdi, rbx, and stack... we're done
01276 mov rdi,qword ptr [rsp+30h]
0127B add rsp,20h
0127F pop rbx
01280 ret

That is a LOT of extra code. Definitely use the const ref pattern over the value pattern. The destruction is the usual 50ish bytes and the setup was only a little less than the usual construction cost but it had an interlocked increment so in some sense in was more cost.

Next I wanted to see how effective the move pattern was if you used it to capture a returned shared pointer. I wrote a function that did just that.

shared_ptr<int> makeSharedPtr() {
return make_shared_ptr<int>(20);
}

And then I called it, like so:

auto p1 = makeSharedPtr();

Here’s what you get for this code:

// blast nulls into p1 then make the call,
// note that the return storage goes in rcx just like it's arg1
//(which it is)
//
// 16 bytes of zeroes
01489 xorps xmm0,xmm0
0148C lea rcx,[p1]
// blast ho!
01491 movups xmmword ptr [p1],xmm0
01496 call makeSharedPtr (01170h)

This is pretty good. The call side works perfectly, the control block and storage is allocated in the usual way, the only change from normal is that the storage in rcx is used to hold the data pointer and control block pointer, so it just goes to a different place on the stack.

std::shared_ptr<int> makeShared() {
// normal frame setup
011A0 push rbx
011A2 sub rsp,30h
// stash incoming arg for storage
011A6 mov rbx,rcx
011A9 mov qword ptr [rsp+28h],rcx
return std::make_shared<int>(5);
// same alloc as always
011AE mov ecx,18h
011B3 call operator new
// same blast control block and fill with vtable and 1, 1 counts
011B8 xorps xmm0,xmm0
011BB mov qword ptr [rsp+28h],rax
011C0 lea rcx,[...::`vftable']
011C7 movups xmmword ptr [rax],xmm0
011CA mov qword ptr [rax],rcx
011CD lea rcx,[rax+10h]
011D1 mov dword ptr [rax+8],1
011D8 mov dword ptr [rax+0Ch],1
// store payload 5 in main storage
011DF mov dword ptr [rcx],5
// blast shared pointer out storage (redundant)
011E5 movups xmmword ptr [rbx],xmm0
// zero out the control block pointer (extra redundant)
// it's really zero now, double zero, like so zero.. omg zero that zero!
011E8 mov qword ptr [rbx+8],0
// store control block pointer, (we were only kidding about the zero)
011F0 mov qword ptr [rbx+8],rax
// we'll return the same address we were given
011F4 mov rax,rbx
// store the data pointer before we exit completing the storage
011F7 mov qword ptr [rbx],rcx
}
011FA add rsp,30h
011FE pop rbx
011FF ret

Other than some redundant null stores it’s pretty perfect. So the move pattern is really working.

There are two more patterns I wanted to look at. The first is an anti-pattern, what if I try to make a shared pointer using a byref arg old school, like take the address of the pointer. Is this really bad? It turns out it is.

The call side actually looks pretty good:

makeSharedPtrByRef(&p1);
// this looks just like the const ref call, how bad could it be?
// cover your eyes...
01587 lea rcx,[p1]
0158C call makeSharedPtrByRef (011D0h)
01591 nop

Here is the beast, never do this… we have to do nearly-general smart pointer management to get this right. The problem is, we can’t assume anything about the out arg in this context. It might be null, it might not. So we may or may not need to downcount before we store. The rest is basically the same.

void makeSharedPtrByRef(shared_ptr<int> *p) {
*p = make_shared<int>(50);
}
void makeSharedPtrByRef(shared_ptr<int> *p) {
// stash rbx in home storage, construct a frame
011D0 mov qword ptr [rsp+10h],rbx
011D5 push rdi
011D6 sub rsp,30h
// save the target pointer in rbx, very similar to the move case
011DA mov rbx,rcx
// no move semantics, it might have to downcount... total cost to initialize
// a trivial shared_ptr is 66 bytes including the alloc call
*p = make_shared<int>(50);
011DD mov ecx,18h
011E2 call operator new (016D0h)
// vtable for the block in rcx
011E7 48 8D 0D AA 22 00 00 lea rcx,[_Ref_count_obj2<int>::`vftable']
// store the allocated object, the control block pointer
011EE mov qword ptr [rsp+20h],rax
// clear out the control block storage, this parts is 44 bytes
011F3 xorps xmm0,xmm0
011F6 movups xmmword ptr [rax],xmm0
// rcx has the vtable pointer
011F9 mov qword ptr [rax],rcx
// rcx now pointer to the int storage
011FC lea rcx,[rax+10h]
// store the 50 into the int storage
01200 mov dword ptr [rcx],32h
// and now the strong
01206 C7 40 08 01 00 00 00 mov dword ptr [rax+8],1
// and weak count
0120D C7 40 0C 01 00 00 00 mov dword ptr [rax+0Ch],1
// now we have to store the pointers into the smart pointer
// first stash the existing control block into rdi
01214 mov rdi,qword ptr [rbx+8]
01218 mov qword ptr [rbx],rcx // stored data
0121B mov qword ptr [rbx+8],rax // control block
// now we have the OLD control block pointer in rdi
// if there was nothing there we don't have to do anything
0121F test rdi,rdi
01222 je (01250h)
// otherwise we do the usual downcounts strong and optionally weak
// this is a 46 byte sequence including test above
01224 mov ebx,0FFFFFFFFh
01229 mov eax,ebx
0122B lock xadd dword ptr [rdi+8],eax
01230 cmp eax,1
01233 jne (01250h)
// destroy object if strong count to zero
01235 mov rax,qword ptr [rdi]
01238 mov rcx,rdi
0123B call qword ptr [rax]
0123D lock xadd dword ptr [rdi+0Ch],ebx
01242 cmp ebx,1
01245 jne (01250h)
// destroy control block if weak count went to zero
01247 mov rax,qword ptr [rdi]
0124A mov rcx,rdi
0124D call qword ptr [rax+8]
}
// we're done... restore the registers and the frame
01250 mov rbx,qword ptr [rsp+48h]
01255 add rsp,30h
01259 pop rdi
0125A ret

That was a lot of work, but we only got a taste of the general pointer assignment. This will be our final overtime sample. I made a global shared pointer pq and we assign to it from p1. Now in general when you do an assignment you have to:

  • upcount the right hand side if it’s not null
  • downcount the left hand side if it’s not null
  • finally store the right into the left

So that’s a lot of checks… let’s look at one:

// 45 bytes to do the upcount...  50 for the downcount?
// So a pointer assignment is ... 95 bytes... give or take
// register fluke? Plus 2 or 3 interlocked operations.
pq = p1;
// start by checking if p1 is null, if null don't upcount it
05C6 mov rbx,qword ptr [rsp+38h]
05CB test rbx,rbx
05CE je main+99h (05D9h)
// bump the count strong count
05D0 lock inc dword ptr [rbx+8]
// save the old control block, for pq, we may need to downcount it
05D4 mov rbx,qword ptr [rsp+38h]
// slam the shared pointer values into the target pq
05D9 mov rax,qword ptr [p1]
05DE mov qword ptr [pq],rax
05E5 mov rdi,qword ptr [pq+8h]
05EC mov qword ptr [pq+8h],rbx
// the downcount pattern in this case is 50 bytes, it varies
// depending on what registers are available, it was 46 bytes above.
// this is the usual stuff, get -1 in esi and then use it to
// downcount
05F3 mov esi,0FFFFFFFFh
// we don't do any of this if the pointer was null, we don't
// downcount nulls
05F8 test rdi,rdi
05FB je main+0EBh (062Bh)
// once we're here the downcount is happening...
05FD mov eax,esi
05FF lock xadd dword ptr [rdi+8],eax
0604 cmp eax,1
0607 jne main+0E6h (0626h)
0609 mov rax,qword ptr [rdi]
060C mov rcx,rdi
060F call qword ptr [rax]
0611 mov eax,esi
0613 lock xadd dword ptr [rdi+0Ch],eax
0618 cmp eax,1
061B jne main+0E6h (0626h)
061D mov rax,qword ptr [rdi]
0620 mov rcx,rdi
0623 call qword ptr [rax+8]

Wow, so that’s a lot of code. 95 bytes maybe. And there were likely to be 2 interlocked operations and if we’re unlucky it’ll be three. Those operations, well they can be fast, maybe as low as 5ns if the relevant machine words are already in l1 cache and exclusive but if there’s a cache miss that could be maybe 100ns or so, maybe 250 instructions? Not cheap. We might get 3 of those… Of course if we take a page fault then it all goes totally sideways but just the cache misses can increase the cost from 100ish instructions to 500ish instructions.

Similarly, that weak pointer to strong pointer business is not cheap. We have to do repeated trial upgrades using interlocked compare exchange. That’s the classic LOAD and CAS (check and store) operation pattern. We could miss the cache on the initial load and the any number of times on the CAS loop. But hopefully not. Then we still have 1 or 2 interlocked decrements to clean up temporary strong pointer. This is another 100+ byte pattern. It looks small, but it isn’t.

Another thing this really gets you thinking about this this kind of pattern…

class baz {
shared_ptr<foo> p1; // strong possibility of false sharing...
shared_ptr<bar> p2;
// p3, p4, ... I've seen cases with half a dozen of these...
};

When you see this business you have to ask yourself, do p1 and p2 really have totally independent lifetime in the context of a baz? Because probably they don't and if those were unique_ptr types whose lifetime is control by baz (which maybe gets its own shared_ptr) you'll get a lot more economy. Internal unique pointers give you automatic cleanup for basically no extra cost (compared to manual).

What really can kill you though is if there’s threading going on. Having contiguous pointers like that means likely the allocations are also contiguous, that sounds good until you get any contention. Then you’ll have multiple control blocks on the same cache line with different threads contending on them. That’s a nightmare of cache misses…

Remember if p1 and p2 are allocated together then their data is more likely to end up nearby on the heap (but maybe it still won’t).

You could save yourself a lot of aggrevation if p1 and p2 were unique pointers and only the outer object had ref counts.

And here we end the teardown.

The main conclusions are:

  • unique_ptr is pretty optimal
  • shared_ptr is about as good as it could be, there seems to be room for a few minor optimizations but nothing crazy, however “as good as it could be” is not “cheap” — the pattern to upcount a shared pointer is about 45 bytes and a downcount is about 50 bytes. Upcount has one locked operation, downcount has 2, only one of which usually runs except for the last downcount which does 2. Each locked operation can take the equivalent of several hundred instructions if it doesn’t already have exclusive ownership of the cache line that the control block is on
  • weak pointer has a single locked upcount for each create and a single downcount for each destruction; however upgrading from weak to strong requires a complex lock cmpxchg loop (this isn’t unexpected) and this can be very costly indeed if the shared pointer is in fact shared
  • false sharing can be a real issue, so putting shared_ptrs close to each other is usually a bad idea

Be careful out there, these things have sharp edges.

Postscript:

Redditor ReDucTor points out:

// for reasons I do not understand rdi didn’t get stored in the home
// storage in the remaining 64-bit slot. Like many other cases we
// push it old-school. Maybe that’s smaller code, I dunno.

// rsp + 48 -> “home storage” slot 3 rdi could have gone here?

The extra 8 bytes are for alignment, the MS ABI defines that the stack will always be 16-byte aligned so the 28h + 8h for return address on call instructions, this ensures when the call occurs its frame also starts on the 16-byte alignment.

This alignment requirement might also be why you see things like in the weakPtrFunction the rdi being pushed onto the stack instead of saved getting saved into a home slot, it already needed to have that extra alignment bit of alignment so why not push it on which is likely smaller encoding then doing a mov into a home storage.

--

--

Rico Mariani
Rico Mariani

Written by Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.