std::pointer types — a tear down and discussion

31 min readApr 5, 2023

I was wanting to get a more complete understanding of exactly what code gets generated when we compile some typical smart pointer patterns in C++ so I created a sample app that did some of the things. I then looked at the disassembly and annotated it heavily. Now we should keep in mind that this is just one compiler, MSC, and it's on Windows, so the code would be different on Linux if only because the ABI is different and clang/gcc are likely to make some different choices overall. But I think it's still super interesting to dig into the output at this level.

Below is the program I started with. Later I go into an "overtime" section where I added a few more small functions to look at some patterns. The source for those is included below but they are very small fragments so it isn't much anyway.

Keep in mind that some of what is happening below is done just to ensure inlining doesn't make it all vanish, so there are some non-idiomatic patterns. I built with /O2 (prefer speed) which is likely to make different choices also. And I didn't enable any advanced security features or anything like that.

So YMMV. But here it is.

// Trivia: this code was generated by ChatGPT and lightly edited
#include <memory>

using namespace std;

__declspec(noinline)
void sharedPtrFunction() {
    shared_ptr<int> ptr1 = make_shared<int>(10);
    printf("value of ptr1: %d\n", *ptr1);
    printf("Use count before ptr2: %d\n", ptr1.use_count());

    shared_ptr<int> ptr2 = ptr1;
    printf("value of ptr2: %d\n", *ptr2);
    printf("Use count before exit: %d\n", ptr1.use_count());
}

__declspec(noinline)
void uniquePtrFunction() {
    unique_ptr<int> uptr(new int(20));
    printf("value of uptr: %d\n", *uptr);
}

__declspec(noinline)
void weakPtrFunction() {
    shared_ptr<int> sptr = make_shared<int>(30);
    weak_ptr<int> wptr = sptr;
    printf("Use count before lock: %d\n", sptr.use_count());
    if (auto sptr2 = wptr.lock()) {
        printf("value of wptr: %d\n", *sptr2);
        printf("Use count during lock: %d\n", sptr.use_count());
    }
    else {
        printf("not reached in this test\n");
    }
    printf("Use count after lock: %d\n", sptr.use_count());
}

__declspec(noinline)
void oldSchoolFunction() {
    int* ptr = new int(5);
    printf("value of ptr: %d\n", *ptr);
    delete ptr;
}

int main() {
    sharedPtrFunction();
    uniquePtrFunction();
    weakPtrFunction();
    oldSchoolFunction();
    return 0;
}

Looking at the first function, it does a few things, and we can use it to tear down some typical things that happen in the prolog as well as some shared pointer patterns. It also does some unusual things. The use_count call will cause atypical code patterns to appear in the output. However, that's no big deal, we'll have other cases later that don't have those and the extra code helps to show where things are stored.

Let's review the code:

void sharedPtrFunction() {
    shared_ptr<int> ptr1 = make_shared<int>(10);
    printf("value of ptr1: %d\n", *ptr1);
    printf("Use count before ptr2: %d\n", ptr1.use_count());

    shared_ptr<int> ptr2 = ptr1;
    printf("value of ptr2: %d\n", *ptr2);
    printf("Use count before exit: %d\n", ptr1.use_count());
}

And now the disassembly…

void sharedPtrFunction() {// The x64 Application Binary Interface (ABI) uses a 
// four-register fast-call calling convention by default. 
// Space is allocated on the call stack as a "home storage"
// for callees to save those registers.
// Note: this is Windows ABI only.  Linux is different, it has
// home storage and up to 6 registers used for arguments.
//
// What follows is the frame after the preamble has run.
//
// I have addresses in increasing order going down putting
// the top of the stack at the top. On x64 esp points to the last
// thing that was pushed, not the next free thing. i.e.
//  * push is "*(--rsp) = value"
//  * pop is "result = *(rsp++)";  
//
// rsp - nnn -> anything pushed will go here
// rsp       -> new "home storage" 0  (our callees use this space)
// rsp + 08h -> new "home storage" 1
// rsp + 10h -> new "home storage" 2
// rsp + 18h -> new "home storage" 3
// rsp + 20h -> saved rdi
// rsp + 28h -> the return address  (at entry rsp pointed here)
// rsp + 30h -> "home storage" slot 0 control block address here
// rsp + 38h -> "home storage" slot 1 rbx saved here (was rsp+10h)
// rsp + 40h -> "home storage" slot 2 unused (rdi could be here?)
// rsp + 48h -> "home storage" slot 3 unused (not needed)// use "home storage" to save rbx
01080  mov         qword ptr [rsp+10h],rbx  
// preserve rdi the old school way
01085  push        rdi 
// allocate "home storage" for functions we call
01086  sub         rsp,20h  
    shared_ptr<int> ptr1 = make_shared<int>(10);// allocate room for 3 qwords, for the control block plus the integer
0108A  mov         ecx,18h  
0108F  call        operator new (013F0h)  // stash the smart pointer value in rbx
01094  mov         rbx,rax  // and also stash the control block pointer value on the stack
01097  mov         qword ptr [rsp+30h],rax  // zero xmm0 128 bits of zeros (16 bytes), used shortly
0109C  xorps       xmm0,xmm0      printf("value of ptr1: %d\n", *ptr1);// load the value of the string in pointer rcx
// this will be the first arg to printf
// we're actualy interspersing code for the next statement
// we're still also construting the shared pointer
0109F  lea         rcx,[string "value of ptr1: %d\n" (03320h)]  // get the value of the integer 10 which we plan to store
// put it in edx as arg2, where printf needs it
010A6  mov         edx,0Ah    shared_ptr<int> ptr1 = make_shared<int>(10);// blast two qwords to zeros at rax, the control block for ptr1
010AB  movups      xmmword ptr [rax],xmm0  
// set its strong use count to 1  (32 bits)
010AE  mov         dword ptr [rax+8],1  
// set its weak use count to 1 (32 bits)
010B5  mov         dword ptr [rax+0Ch],1  
// load the address of the control block vtable
010BC  lea         rax,[std::_Ref_count_obj2<int>::`vftable']  // store the vtable address in the control block,
// remember we stored the address of the control block in rbx
// so until the above rax and rbx were the same.
// rbx is preseved across calls so we'll continue to use it later...
010C3  mov         qword ptr [rbx],rax  // finally, store 10, control block offset rbx+10h is the payload
010C6  mov         dword ptr [rbx+10h],0Ah  
    printf("value of ptr1: %d\n", *ptr1);// now we already have rcx and rdx holding the format string and
// the value; we can print
010CD  call        printf (01010h)  // we can get the use count (strong) from the control block pointer
// which is still in rbx. rdx will hold the count, arg2 to printf
    printf("Use count before ptr2: %d\n", ptr1.use_count());
010D2  mov         edx,dword ptr [rbx+8]  // and the format string
010D5  lea         rcx,[string "Use count before ptr2: %d\n"]  // and now we can print the use count (rcx, rdx args as usual)
010DC  call        printf (01010h) // now we're going to make a copy of the pointer    shared_ptr<int> ptr2 = ptr1;// the minimum thing we need to do is interlocked increment the
// strong count in this case we could have optimized out everything
// else pretty much but not this, well I guess this too but then we
// would have to be pretty omniscient about the use count to print
// it correctly.  All the printfs are forcing normal housekeeping.
// Note we didn't actually copy the pointer here, that was optimized
// out.  We could have done that copy with two 64-bit ops or one xmm
// 128 bit op.  But, instead, we don't even bother.
// We're left with this:010E1  lock inc    dword ptr [rbx+8]      printf("value of ptr2: %d\n", *ptr2);// as before we read the value right from rbx+10, 
// we didn't actually use ptr2, the whole thing is gone
010E5  mov         edx,dword ptr [rbx+10h]  
010E8  lea         rcx,[string "value of ptr2: %d\n" (03358h)]  
010EF  call        printf (01010h)      printf("Use count before exit: %d\n", ptr1.use_count());// we're getting the count again, note this is not thread safe...
// there's no memory barrier it's a boring read...
// so really, don't do this :D
// note that we often generate a null check before reading the
// control block to get the use count but in this case we've
// inferred that it can't be null so we're reading the memory
// directly.  The norm is to check for a null control block and
// if null use zero as the count.
010F4  mov         edx,dword ptr [rbx+8]  
010F7  lea         rcx,[string "Use count before exit: %d\n"]  
010FE  call        printf (01010h)  
}// now we have to do the teardown, load -1 into edi and eax
01103  mov         edi,0FFFFFFFFh  
01108  mov         eax,edi  // interlocked exchange add (-1) to atomically downcount the strong count
// after this eax will hold what the count *was* before we changed it
0110A  lock xadd   dword ptr [rbx+8],eax  // if the strong count was not 1, we're done, there are still refs
// if it was 1, then it is now zero, hence time to destruct.
// in our case we will never take this branch because we know
// there is a 2nd shared pointer.  We'll always get a 2 in eax here.
0110F  cmp         eax,1  
01112  jne         sharedPtrFunction+0B1h (01131h)  // load the vtable pointer from the control block
01114  mov         rax,qword ptr [rbx]// recall the layout of the control block:
//   * vtable ptr (64 bits)
//   * strong count (32 bits)
//   * weak count (32 bits)// setup rcx as the "this" argument for the destructor
01117  mov         rcx,rbx  
// call the first thing in the vtable, the dtor
// in our case this is a no-op because int doesn't need destruction
0111A  call        qword ptr [rax]// Now we're going to downcount the weak count, all the strong
// pointers together share one weak count. We need to do this
// because we can't free the control block until both weak and
// strong references are gone. The last strong ref does one weak
// downcount, that means strong refs don't have to upcount both
// counts and only downcount twice rarely.
0111C  mov         eax,edi // -1 
0111E  lock xadd   dword ptr [rbx+0Ch],eax  
01123  cmp         eax,1  
01126  jne         sharedPtrFunction+0B1h (01131h)  // if the weak count was 1, it's now zero and it's time to destroy
01128  mov         rax,qword ptr [rbx]  
0112B  mov         rcx,rbx  
0112E  call        qword ptr [rax+8]  // control block destructor// we have two shared pointers this is the same code for ptr2
// this time we will take the cleanup path
01131  mov         eax,edi  
01133  lock xadd   dword ptr [rbx+8],eax  
01138  cmp         eax,1  // we will never take the branch in this code
0113B  jne         sharedPtrFunction+0D8h (01158h)  // we will for sure call the destructor here
0113D  mov         rax,qword ptr [rbx]  
01140  mov         rcx,rbx  
01143  call        qword ptr [rax]  // destroy the int// now the weak pointer... for sure we will do this
01145  lock xadd   dword ptr [rbx+0Ch],edi  
0114A  cmp         edi,1  // we never take this branch, this is the last weak ref
0114D  jne         sharedPtrFunction+0D8h (01158h)  // this code always runs, here we clean up the control block
0114F  mov         rax,qword ptr [rbx] // get vtable
01152  mov         rcx,rbx             // load "this"
01155  call        qword ptr [rax+8]   // destroy the control block// finally restore the saved rbx and fix the frame pointer
01158  mov         rbx,qword ptr [rsp+38h]  
0115D  add         rsp,20h  
01161  pop         rdi  
01162  ret  --- include\memory 
// destructor for the control block itself, this is a tail call
        delete this;// deleting a null control block pointer is a no-op
01360  test        rcx,rcx  
01363  je          01371h  // we get the vtable pointer from rcx which points to the object
// hence rcx is already set up to be the this pointer for the call
01365  mov         rax,qword ptr [rcx]  // rdx carries the directive to do the dealloc (it's a bool)
01368  mov         edx,1  // this is the virtual destructor for the control block, below as
// std::_Ref_count_obj2<int>::`scalar deleting destructor'
// tail call optimized
0136D  jmp         qword ptr [rax+10h]  
    }
// only use if the we were in the null case
01371  ret // normal destructor for the smart pointer storage type, 
// in this case its int so there is nothing to do
        _Destroy_in_place(_Storage._Value);
    }
01380  ret         0  
// this is the actual control block destructorstd::_Ref_count_obj2<int>::`scalar deleting destructor':
// save rbx, set up a frame (didn't use home storage, wonder why)
01390  push        rbx  
01392  sub         rsp,20h  // we get our vtable pointer directly
01396  lea         rax,[std::_Ref_count_obj2<int>::`vftable']  
0139D  mov         rbx,rcx  // we change the vtable type to this base class,
// note that when destructing vtables unwind to base types!
013A0  mov         qword ptr [rcx],rax  // Now we check to see if a delete was requested (in our case
// it will be) Note that test dl,1 is an AND of dl with 1 and
// sets the flags accordingly, it's not a compare.
// If dl == 1 as we expect then the result of the AND will be 1
// and the Z flag is not set, hence je is not taken.
// we will never take this branch in our sample program
013A3  test        dl,1  
013A6  je          013B2h  // delete the allocated storage (control block and data)
013A8  mov         edx,18h  
013AD  call        operator delete (0142Ch)  // we're going to return "this", it's in rbx, having been moved there
// from rcx, we need it in rax
013B2  mov         rax,rbx  // restore rbx and the frame pointers and we're outta here
013B5  add         rsp,20h  
013B9  pop         rbx  
013BA  ret

In the above we get a good look at a bunch of things. The normal construction pattern for a smart pointer, which is pretty economical, and the teardown, which requires up to two interlocked decrements. The teardown pattern seems to take between 45 and 50 bytes depending on which registers happen to hold the pointer in question. The code pattern is always the same,

// decrements are interlocked
// this is 45 to 50 bytes of code
if (strong_refs-- == 1) { 
    destroy_object(); 
    if (weak_refs-- == 1) {
        destroy_control_block();
    }
}

The unique pointer version of this code is really simple by comparison. Let’s have a look.

Recall the unique pointer version looked like this, and the disassembly follows.

void uniquePtrFunction() {
    unique_ptr<int> uptr(new int(20));
    printf("value of uptr: %d\n", *uptr);
}

void uniquePtrFunction() {
// preserve rbx  and sert up a frame
01170  push        rbx
01172  sub         rsp,20h
    unique_ptr<int> uptr(new int(20));
// allocate 4 bytes, for the int, that's it...
01176  mov         ecx,4  
0117B  call        operator new (013F0h)    printf("value of uptr: %d\n", *uptr);// the compiler has figured out that the int holds 20 so it loads
// it in a register directly
01180  mov         edx,14h  // we're setting format string for printf
01185  lea         rcx,[string "value of uptr: %d\n"]  // before we make the call, we save allocated memory pointer in rbx
// note rbx is preserved across calls (we had to save it too, above)
0118C  mov         rbx,rax  // and we store the value 20 in there
0118F  mov         dword ptr [rax],14h  // now we print, we have the args in rcx, rdx as usual
// we can't tail call becaue we have cleanup to do
01195  call        printf (01010h)  
}// cleanup is simple, we need to free 4 bytes 100% of the time
0119A  mov         edx,4  
0119F  mov         rcx,rbx  // we clean up the stack, restore rbx
011A2  add         rsp,20h  
011A6  pop         rbx  // and tail call to delete rcx (pointer), rdx (length) set up above
011A7  jmp         operator delete (0142Ch)=

The code above was basically optimal for this problem. I guess a truly psychic compiler might have figured out that it didn’t need to do the allocations at all and just printed the correct value but that’s probably asking too much.

Now let’s look at the old school version of all this, by an astonishing coincidence it’s identical except that it stores 5 in the int instead of 20. There is was no overhead in unique_ptr<>, it was the same as doing it manually (as expected). Let’s refresh our memory of the old school function:

void oldSchoolFunction() {
    int* ptr = new int(5);
    printf("value of ptr: %d\n", *ptr);
    delete ptr;
}

Digging in…

void oldSchoolFunction() {
// same frame setup
01300  push        rbx  
01302  sub         rsp,20h// same alloc
    int* ptr = new int(5);
01306  mov         ecx,4  
0130B  call        operator new (013F0h)  // same print (but using 5 not 20)
    printf("value of ptr: %d\n", *ptr);
01310  mov         edx,5  
01315  lea         rcx,[string "value of ptr: %d\n" (03440h)]  // same storing the allocated address in rbx
0131C  mov         rbx,rax  
0131F  mov         dword ptr [rax],5  
01325  call        printf (01010h)  
    delete ptr;
// same cleanup on exit, length in rdx, pointer in rcx
0132A  mov         edx,4  
0132F  mov         rcx,rbx  
}
01332  add         rsp,20h  
01336  pop         rbx  
    delete ptr;// same tail call optimization
01337  jmp         operator delete (0142Ch)

Can’t improve on perfection, right?

Now lets move on the the weak pointer function, it’s more interesting. There is an allocation of a shared pointer, then a weak pointer, then an upgrade to a strong reference, then the upgraded ref is destroyed then the weak reference is destroyed, and, finally, the strong reference is destroyed. It’s a lot of code…

void weakPtrFunction() {
    shared_ptr<int> sptr = make_shared<int>(30);
    weak_ptr<int> wptr = sptr;
    printf("Use count before lock: %d\n", sptr.use_count());
    if (auto sptr2 = wptr.lock()) {
        printf("value of wptr: %d\n", *sptr2);
        printf("Use count during lock: %d\n", sptr.use_count());
    }
    else {
        printf("not reached in this test\n");
    }
    printf("Use count after lock: %d\n", sptr.use_count());
}

Here we go…

void weakPtrFunction() {
// we're gonna set up the frame like we in the strong pointer sample
// and again we'll use our home storage for our own purposes.
// rbx and rsi will be stored in the home space, another slot will
// be used to store the control block we're going to allocate but
// for reasons I do not understand rdi didn't get stored in the home
// storage in the remaining 64-bit slot.  Like many other cases we
// push it old-school. Maybe that's smaller code, I dunno.// after the preamble the frame looks like this:// rsp      -> new "home storage" 0 our callees get this space
// rsp + 8  -> new "home storage" 1
// rsp + 10 -> new "home storage" 2
// rsp + 18 -> new "home storage" 3
// rsp + 20 -> saved rdi
// rsp + 28 -> the return address
// rsp + 30 -> "home storage" slot 0 control block ptr saved here
// rsp + 38 -> "home storage" slot 1 rbx saved here
// rsp + 40 -> "home storage" slot 2 rsi saved here
// rsp + 48 -> "home storage" slot 3 rdi could have gone here?// save rbx and rsi into home storage, push rdi, alloc home storage
// we don't need space for our locals because home storage has space
011B0  mov         qword ptr [rsp+10h],rbx  
011B5  mov         qword ptr [rsp+18h],rsi  
011BA  push        rdi  
011BB  sub         rsp,20h  // "home storage" is now at rsp+30,  previously it was at rsp+8// we want a zero in edi for math later
011BF  xor         edi,edi      shared_ptr<int> sptr = make_shared<int>(30);// I have no idea why we put rdi into the calculation...
// 0+18h is still 18h. Why it was done like this, who knows...
// Maybe the 18h constant is smaller in this encoding.  
// Anyhow alloc the control block like usual.
011C1  lea         ecx,[rdi+18h]  
011C4  call        operator new (013F0h)  // stash the allocated pointer in rbx
011C9  mov         rbx,rax  // save the control block on the stack for the strong pointer
// not that even though nominally there are two pointers in a shared_ptr
// one of them is not stored, it's going to go in rsi below.
011CC  mov         qword ptr [rsp+30h],rax  // zero out xmm0 so we can hammer in 16 nulls later
011D1  xorps       xmm0,xmm0 // save the int pointer in rsi, this is where our int is stored
// this is the second pointer of the shared_ptr
011D4  lea         rsi,[rbx+10h]  // clear the control block and set the strong and weak counts to 1
// then store the vtable pointer, this is standard control block
// construction.
011D8  movups      xmmword ptr [rax],xmm0  
011DB  mov         dword ptr [rax+8],1  
011E2  mov         dword ptr [rax+0Ch],1  
011E9  lea         rax,[std::_Ref_count_obj2<int>::`vftable' (03460h)]  
011F0  mov         qword ptr [rbx],rax  // store the value 30 in the integer
011F3  mov         dword ptr [rsi],1Eh      weak_ptr<int> wptr = sptr;// we have a new weak pointer to construct, increment the weak count
// rbx is the control block, the weak count is at offet +0Ch
011F9  lock inc    dword ptr [rbx+0Ch]      printf("Use count before lock: %d\n", sptr.use_count());// pull the strong use count out of the control block and print it
011FD  mov         edx,dword ptr [rbx+8]  
01200  lea         rcx,[string "Use count before lock: %d\n" (033A8h)]  
01207  call        printf (01010h)  // now we're trying to upgrade from weak to strong
    if (auto sptr2 = wptr.lock()) {// First we check the strong count, with no interlock, if it is
// zero then we cannot lock, the object is already destroyed.
// This will never happen in our code, it is the fast-out path for
// when Elvis has already left the building.0120C  mov         eax,dword ptr [rbx+8]  
0120F  test        eax,eax  
01211  je          weakPtrFunction+82h (01232h) // can't lock// burn space (for alignment reasons?)
01213  nop         dword ptr [rax]  
01217  nop         word ptr [rax+rax]  // now we try to upcount... this is a rendezvous of sorts with
// other threads that may be trying to downcount.
// We have the lock count that we saw in rax, we add 1 to it
// and store it in ecx, this what we want to store.
01220  lea         ecx,[rax+1]  // Now we try to switch ecx into the count but only if the count
// is still equal to eax. note that eax is an implicit argument
// in this opcode.
// Basically, with one ATOMIC instruction we do all of this magic!
//
//    if ([rbx+8] == eax) { 
//       eax preserved
//       [rbx+8] = ecx;   // exchange success
//       set zero flag;   // je will succeed
//    } 
//    else { 
//       eax = [rbx+8];   // no exchange case
//       clear zero flag; // je will not succeed
//    } 
//01223  lock cmpxchg dword ptr [rbx+8],ecx  [eax implied]
01228  je          weakPtrFunction+11Ch (012CCh) // do lock body!// If we didn't take the branch then eax is the value
// that was in [rbx+8]. If it is now zero, we can't lock,
// the object is gone. Otherwise, we try again upcounting from
// the new, updated, value.
0122E  test        eax,eax  
01230  jne         weakPtrFunction+70h (01220h)  // try again// don't take the lock path
    else {
        printf("not reached in this test\n");// this code never runs in our case, it means the weak pointer
// cannot be upgraded in our code we can always get the strong
// pointer and we never race because we only have one thread.
01232  lea         rcx,[string "not reached in this test\n"]  
01239  call        printf (01010h)  
    }// we're done with the lock, the temp strong pointer goes away here
0123E  mov         esi,0FFFFFFFFh  // now if we never actually took the lock and we're on the early
// out path we don't do the cleanup...  recall we put 0 in rdi
// near the top. Once we create the lock, rdi will be non-zero
// as we will see below. In fact rdi points to the control block
// of the temp shared_ptr.
01243  test        rdi,rdi 
// skip lock destruction if lock never acquired
01246  je          weakPtrFunction+0C1h (01271h)// ok now we're going to do the usual downcount operations
// esi has -1 in it so we're adding -1 like usual.
01248  mov         eax,esi  
0124A  lock xadd   dword ptr [rdi+8],eax  
0124F  cmp         eax,1  
01252  jne         weakPtrFunction+0C1h (01271h)// If needed, get the vtable pointer out rdi, which holds the
// control block and call the destructor.
01254  mov         rax,qword ptr [rdi]  
01257  mov         rcx,rdi  
0125A  call        qword ptr [rax]  // Like usual, we now now reduct the weak pointer count, the last
// strong pointer went away. We put the -1 back in eax, 
// do the subtract (add -1) and then test
0125C  mov         eax,esi  
0125E  lock xadd   dword ptr [rdi+0Ch],eax  
01263  cmp         eax,1  
01266  jne         weakPtrFunction+0C1h (01271h)// if that was the last weak pointer (it won't be) then we are going
// to do the control block cleanup
01268  mov         rax,qword ptr [rdi]  
0126B  mov         rcx,rdi  
0126E  call        qword ptr [rax+8] // In any case we're going to print the current strong count
// after the lock has been released. We load it from the control
// block still in rbx even after all this time and get the count
// out of it.
    printf("Use count after lock: %d\n", sptr.use_count());
01271  mov         edx,dword ptr [rbx+8]  
01274  lea         rcx,[string "Use count after lock: %d\n" (03420h)]  
0127B  call        printf (01010h)  
}// Finally it's all done, this is the normal cleanup, now we're
// gonna clean up our weak pointer and then the strong pointer. 
// These are the usual patterns.// Put -1 back in eax yet again, still here in esi
// this is the weak pointer count going down for weak pointer
// destruction. Weak count is at rbx+0Ch
01280  mov         eax,esi  
01282  lock xadd   dword ptr [rbx+0Ch],eax  
01287  cmp         eax,1  
0128A  jne         weakPtrFunction+0E5h (01295h)  // Recall that all the strong pointers share one weak count
// so if the weak count goes to zero it's ALL done. We tear down
// the control block right there. This won't happen in this
// particular code because we have a strong pointer outstanding.
0128C  mov         rax,qword ptr [rbx]  
0128F  mov         rcx,rbx  
01292  call        qword ptr [rax+8]  // Now the strong pointer goes away... first down count the strong
// count and then the weak count as before, -1 still in esi.
01295  mov         eax,esi  
01297  lock xadd   dword ptr [rbx+8],eax  
0129C  cmp         eax,1  
0129F  jne         weakPtrFunction+10Ch (012BCh)  // strong count went to zero, destruct the object (an int in this case)
// this code will run
012A1  mov         rax,qword ptr [rbx]  
012A4  mov         rcx,rbx  
012A7  call        qword ptr [rax]   // int destruction (no-op)// now try the weak count (again)
012A9  lock xadd   dword ptr [rbx+0Ch],esi  
012AE  cmp         esi,1  
012B1  jne         weakPtrFunction+10Ch (012BCh)  // if the weak count went to zero then we nuke the control block
// this always runs in our case
012B3  mov         rax,qword ptr [rbx]  
012B6  mov         rcx,rbx  
012B9  call        qword ptr [rax+8]  // restore rbx, and rsi, fix the frame, restore rdi and we're done
012BC  mov         rbx,qword ptr [rsp+38h]  
012C1  mov         rsi,qword ptr [rsp+40h]  
012C6  add         rsp,20h  
012CA  pop         rdi  
012CB  ret  // the "I got the lock" body was compiled out of line    if (auto sptr2 = wptr.lock()) {
// if we acquire the lock, we store the pointer in rdi,
// otherwise rdi stays null. rdi will serve as a flag that
// we did the upcounts.
012CC  mov         rdi,rbx  // Now here's something weird... rsi has in it a pointer to the
// actual storage for the int but we have this redundant looking
// test to make sure it's not null... if it's null then we pretend
// we didn't get the lock... only we did... sort of.  But in the
// null case we will run the code that says we didn't get the lock
// but we will in fact also run the code that downcounts the
// temporary shared pointer, which is correct.
// We have to do this because at this point we already did the
// upcounts. This is kind of weird: if we're here rsi can't be
// null... I have to assume the codegen is more general for other
// cases where the data pointer might be null.
012CF  test        rsi,rsi  
012D2  je          weakPtrFunction+82h (01232h)          printf("value of wptr: %d\n", *sptr2);// rsi points to the int, not the control block
// we set it up in rdx and the format string in rcx like usual
// and call printf
012D8  mov         edx,dword ptr [rsi]  
012DA  lea         rcx,[string "value of wptr: %d\n"]  
012E1  call        printf (01010h)          printf("Use count during lock: %d\n", sptr.use_count());// we're still holding the lock, we print use the use count, 
// this will be 2.
012E6  mov         edx,dword ptr [rbx+8]  
012E9  lea         rcx,[string "Use count during lock: %d\n"]  
012F0  call        printf (01010h)  
    }// now we go back to the main flow
// this is where the lock value well be torn down as we saw
012F5  jmp         weakPtrFunction+8Eh (0123Eh)

Phew, that one was a bit tricky! The weak to strong upgrade is one of the most expensive patterns because there are repeated interlocks and many tests to handle all the boundary cases. Not for the faint of heart.

We can take quick look at main, it’s dead simple. It’s basically just calling the functions we defined, in order, the only magic thing it has to do is create room for the home storage.

int main() {
// create home storage and 8 extra bytes? Why 8 extra? I dunno...
01340  sub         rsp,28h  
    sharedPtrFunction();
01344  call        sharedPtrFunction (01080h)  
    uniquePtrFunction();
01349  call        uniquePtrFunction (01170h)  
    weakPtrFunction();
0134E  call        weakPtrFunction (011B0h)  
    oldSchoolFunction();
01353  call        oldSchoolFunction (01300h)  
    return 0;
// get a zero in eax for our return
01358  xor         eax,eax  
0135A  add         rsp,28h  
0136E  ret  
}

At this point we’re done with the main program but I went into “overtime” a little and added some pretty simple sequences to see costs more directly. I’ll show the source for each one. I want to see the cost of the call and the cost of the function in question in most cases. Note that most of these disassemblies were harvested in different runs at this point so the addresses are not consistent between functions, but the are consistent within a single snip. If the code offsets look weird, that’s why.

The most important thing was I wanted to see what happens when you pass a const shared_ptr<int> & so I wrote this:

void printStuffRef(const shared_ptr<int> & p) {
    printf("value:%d, count:%d\n", *p, p.use_count());
}

A suitable pointer was created and stored in the p1 variable, then we made this call:

printStuffRef(p1);
0149B lea         rcx,[p1]  
014A0 call        printStuffRef

The call is super simple, literally getting the address of the p1 variable and passing it as the first argument. This is exactly what you would expect. Let's look at how printStuffRef develops.

void printStuffRef(const shared_ptr<int> & p) {
    printf("value:%d, count:%d\n", *p, p.use_count());
// rcx points to shared pointer, which points to the control block
// and the object, we get the control block pointer and store in rax
011D0  mov         rax,qword ptr [rcx+8]  
011D4  test        rax,rax// we're checking for a null control block, if it's null we'll use 0
// for the use_count() rather than looking in the control block
011D7  je          printStuffRef+1Eh (011EEh)  // r8 gets the count (r8d is the 32 bit version of r8)
011D9  mov         r8d,dword ptr [rax+8]  
// rax gets the int pointer, this will be used to load rdx for arg2
011DD  mov         rax,qword ptr [rcx]  
// get the format string
011E0  lea         rcx,[string "value:%d, count:%d\n"]  
// and here we get arg2, reading the int value
011E7  mov         edx,dword ptr [rax]  // tail call to printf having used the actual stored use_count
011E9  jmp         printf (01010h)  // alterate choice, the control block is nil and use zero for the count
// the code is otherwise the same.// r8d gets zero, all else is the same.
011EE  xor         r8d,r8d  
011F1  mov         rax,qword ptr [rcx]  
011F4  lea         rcx,[string "value:%d, count:%d\n"]  
011FB  mov         edx,dword ptr [rax]  
011FD  jmp         printf (01010h)

Because the use_count() call added so much junk due to its control block test, I made another example that doesn’t print the use count so I could more accurately see the const ref pattern.

void printValueOnlyRef(const shared_ptr<int>& p) {
    printf("value:%d\n", *p);
}

The call pattern is identical, get the address of p1 and then call.

printValueOnlyRef(p1);
014ED  lea         rcx,[p1]  
014F2  call        printValueOnlyRef (01290h)

void printValueOnlyRef(const shared_ptr<int>& p) {  
  printf("value:%d\n", *p);// normal access is much simpler, no funky control block checks 
// just get the int and go
// get the int pointer into rax (remember we have a ref)
00001290  mov         rax,qword ptr [rcx]
// format string in rcx
00001293  lea         rcx,[string "value:%d\n" (033A8h)]  
// get the int in rdx
0000129A  mov         edx,dword ptr [rax]
// tail call
0000129C  jmp         printf (01010h)

This code is really quite good. Any cost is going to be associate with possible non-locality. The first read into rax might miss because [rcx] could be uncached. Likewise, the second read might miss because [rax] could be uncached.

The next thing I wanted to was compare the cost of calling basically the same function by value. This is of course not what you’re supposed to do but we should look at how bad it is. It’s the same it’s just missing the &.

void printStuffVal(const shared_ptr<int> p) {
    printf("value:%d, count:%d\n", *p, p.use_count());
}

Now looking at the call site, already we see cost.

printStuffVal(p1);
// basically we're about to construct a temp shared pointer and pass
// it by ref. Now rsp+38 is p1's control word, if it's null, 
// we don't haveto do anything (it won't be null in our case).
// If it's not null we have to upcount the shared_ref count because
// of the temp.  If it's a null pointer we don't upcount it.  
// We don't/can't/wouldn't-want-to keep
// counts on nulls.
014A5  mov         rbx,qword ptr [rsp+38h]  
014AA  test        rbx,rbx  
014AD  je          main+58h (014B8h)
// ok normal case, upcount the strong count
014AF  lock inc    dword ptr [rbx+8] 
// reload rbx (seems redundant)
014B3  mov         rbx,qword ptr [rsp+38h] 
// copying the two pointers from p1 into xmm0
014B8  movaps      xmm0,xmmword ptr [p1]  
// get address of temp shared pointer, we need this for the call too
014BD  lea         rcx,[rsp+20h]    
// store the copyed pointers into the temp shared_ptr
// apparently rsp+offset is as good as rcx with no offset as target
014C2  movdqa      xmmword ptr [rsp+20h],xmm0  
014C8  call        printStuffVal (01210h)  // there is no post call cleanup
// the callee cleans up the temp object

That was just the call! Let’s look at the callee side. Recall:

void printStuffVal(const shared_ptr<int> p) {
    printf("value:%d, count:%d\n", *p, p.use_count());
}

void printStuffVal(const shared_ptr<int> p) {
//
// normal frame setup
01210  push        rbx  
01212  sub         rsp,20h
    printf("value:%d, count:%d\n", *p, p.use_count());
// The first part is logically the same as the ref case because
// as before we have a shared_ptr ref in rcx. The code will
// be the same including the control block guard as before.
// In this case we share the call to printf because there is
// no benefit in creating to tail calls.  The difference is
// after the call to printf.
01216  mov         rax,qword ptr [rcx+8]  
0121A  mov         rbx,rcx  
0121D  test        rax,rax  
01220  je          printStuffVal+18h (01228h)  
// get the use count, safe to use
01222  mov         r8d,dword ptr [rax+8]  
01226  jmp         printStuffVal+1Bh (0122Bh)  
// use zero for the count if there is no control block!
01228  xor         r8d,r8d  
0122B  mov         rax,qword ptr [rcx]  
0122E  lea         rcx,[string "value:%d, count:%d\n"]  
01235  mov         edx,dword ptr [rax]  
01237  call        printf (01010h)  
}
// Now we have to do cleanup!  First check for a control block
// A null control block doesn't need destruction, so we skip all
// this and just exit the function in that case.
0123C  mov         rbx,qword ptr [rbx+8]  
01240  test        rbx,rbx  
01243  je          printStuffVal+6Bh (0127Bh)  
// if control block is present, downcount as usual
01245  mov         qword ptr [rsp+30h],rdi  
0124A  mov         edi,0FFFFFFFFh  
0124F  mov         eax,edi  
01251  lock xadd   dword ptr [rbx+8],eax  
01256  cmp         eax,1  
01259  jne         printStuffVal+66h (01276h)  
// main destruction if count 1
0125B  mov         rax,qword ptr [rbx]  
0125E  mov         rcx,rbx  
01261  call        qword ptr [rax]  
// optional weak pointer decrement all strongs share one weak count
01263  lock xadd   dword ptr [rbx+0Ch],edi  
01268  cmp         edi,1  
0126B  jne         printStuffVal+66h (01276h)  
// optional control block destruction if weak count reached zero
0126D  mov         rax,qword ptr [rbx]  
01270  mov         rcx,rbx  
01273  call        qword ptr [rax+8]  
// restore rdi, rbx, and stack... we're done
01276  mov         rdi,qword ptr [rsp+30h]  
0127B  add         rsp,20h  
0127F  pop         rbx  
01280  ret

That is a LOT of extra code. Definitely use the const ref pattern over the value pattern. The destruction is the usual 50ish bytes and the setup was only a little less than the usual construction cost but it had an interlocked increment so in some sense in was more cost.

Next I wanted to see how effective the move pattern was if you used it to capture a returned shared pointer. I wrote a function that did just that.

shared_ptr<int> makeSharedPtr() {
   return make_shared_ptr<int>(20);
}

And then I called it, like so:

auto p1 = makeSharedPtr();

Here’s what you get for this code:

// blast nulls into p1 then make the call,
// note that the return storage goes in rcx just like it's arg1
//(which it is)
//
// 16 bytes of zeroes
01489  xorps       xmm0,xmm0  
0148C  lea         rcx,[p1]
// blast ho!
01491  movups      xmmword ptr [p1],xmm0  
01496  call        makeSharedPtr (01170h)

This is pretty good. The call side works perfectly, the control block and storage is allocated in the usual way, the only change from normal is that the storage in rcx is used to hold the data pointer and control block pointer, so it just goes to a different place on the stack.

std::shared_ptr<int> makeShared() {
// normal frame setup
011A0  push        rbx  
011A2  sub         rsp,30h
// stash incoming arg for storage
011A6  mov         rbx,rcx  
011A9  mov         qword ptr [rsp+28h],rcx  
    return std::make_shared<int>(5);
// same alloc as always
011AE  mov         ecx,18h  
011B3  call        operator new 
// same blast control block and fill with vtable and 1, 1 counts
011B8  xorps       xmm0,xmm0  
011BB  mov         qword ptr [rsp+28h],rax  
011C0  lea         rcx,[...::`vftable']  
011C7  movups      xmmword ptr [rax],xmm0  
011CA  mov         qword ptr [rax],rcx  
011CD  lea         rcx,[rax+10h]  
011D1  mov         dword ptr [rax+8],1  
011D8  mov         dword ptr [rax+0Ch],1 
// store payload 5 in main storage
011DF  mov         dword ptr [rcx],5  
// blast shared pointer out storage (redundant)
011E5  movups      xmmword ptr [rbx],xmm0  
// zero out the control block pointer (extra redundant)
// it's really zero now, double zero, like so zero.. omg zero that zero!
011E8  mov         qword ptr [rbx+8],0  
// store control block pointer, (we were only kidding about the zero)
011F0  mov         qword ptr [rbx+8],rax
// we'll return the same address we were given
011F4  mov         rax,rbx
// store the data pointer before we exit completing the storage  
011F7  mov         qword ptr [rbx],rcx  
 }
011FA  add         rsp,30h  
011FE  pop         rbx  
011FF  ret

Other than some redundant null stores it’s pretty perfect. So the move pattern is really working.

There are two more patterns I wanted to look at. The first is an anti-pattern, what if I try to make a shared pointer using a byref arg old school, like take the address of the pointer. Is this really bad? It turns out it is.

The call side actually looks pretty good:

makeSharedPtrByRef(&p1);
// this looks just like the const ref call, how bad could it be?
// cover your eyes...
01587  lea         rcx,[p1]  
0158C  call        makeSharedPtrByRef (011D0h)  
01591  nop

Here is the beast, never do this… we have to do nearly-general smart pointer management to get this right. The problem is, we can’t assume anything about the out arg in this context. It might be null, it might not. So we may or may not need to downcount before we store. The rest is basically the same.

void makeSharedPtrByRef(shared_ptr<int> *p) {
    *p = make_shared<int>(50);
}

void makeSharedPtrByRef(shared_ptr<int> *p) {
// stash rbx in home storage, construct a frame
011D0  mov         qword ptr [rsp+10h],rbx  
011D5  push        rdi  
011D6  sub         rsp,30h 
// save the target pointer in rbx, very similar to the move case
011DA  mov         rbx,rcx  
// no move semantics, it might have to downcount... total cost to initialize
// a trivial shared_ptr is 66 bytes including the alloc call
    *p = make_shared<int>(50);
011DD  mov         ecx,18h
011E2  call        operator new (016D0h)  
// vtable for the block in rcx
011E7 48 8D 0D AA 22 00 00 lea         rcx,[_Ref_count_obj2<int>::`vftable']  
// store the allocated object, the control block pointer
011EE  mov         qword ptr [rsp+20h],rax  
// clear out the control block storage, this parts is 44 bytes
011F3  xorps       xmm0,xmm0  
011F6  movups      xmmword ptr [rax],xmm0  
// rcx has the vtable pointer
011F9  mov         qword ptr [rax],rcx
// rcx now pointer to the int storage
011FC  lea         rcx,[rax+10h]
// store the 50 into the int storage
01200  mov         dword ptr [rcx],32h
// and now the strong
01206 C7 40 08 01 00 00 00 mov         dword ptr [rax+8],1
// and weak count
0120D C7 40 0C 01 00 00 00 mov         dword ptr [rax+0Ch],1
// now we have to store the pointers into the smart pointer
// first stash the existing control block into rdi
01214  mov         rdi,qword ptr [rbx+8]  
01218  mov         qword ptr [rbx],rcx  // stored data
0121B  mov         qword ptr [rbx+8],rax // control block
// now we have the OLD control block pointer in rdi
// if there was nothing there we don't have to do anything
0121F  test        rdi,rdi  
01222  je          (01250h)  
// otherwise we do the usual downcounts strong and optionally weak
// this is a 46 byte sequence including test above
01224  mov         ebx,0FFFFFFFFh  
01229  mov         eax,ebx  
0122B  lock xadd   dword ptr [rdi+8],eax  
01230  cmp         eax,1  
01233  jne         (01250h)  
// destroy object if strong count to zero
01235  mov         rax,qword ptr [rdi]  
01238  mov         rcx,rdi  
0123B  call        qword ptr [rax] 
0123D  lock xadd   dword ptr [rdi+0Ch],ebx  
01242  cmp         ebx,1  
01245  jne         (01250h)
// destroy control block if weak count went to zero
01247  mov         rax,qword ptr [rdi]  
0124A  mov         rcx,rdi  
0124D  call        qword ptr [rax+8]
}
// we're done... restore the registers and the frame
01250  mov         rbx,qword ptr [rsp+48h]  
01255  add         rsp,30h  
01259  pop         rdi  
0125A  ret

That was a lot of work, but we only got a taste of the general pointer assignment. This will be our final overtime sample. I made a global shared pointer pq and we assign to it from p1. Now in general when you do an assignment you have to:

upcount the right hand side if it’s not null
downcount the left hand side if it’s not null
finally store the right into the left

So that’s a lot of checks… let’s look at one:

// 45 bytes to do the upcount...  50 for the downcount?
// So a pointer assignment is ... 95 bytes... give or take
// register fluke? Plus 2 or 3 interlocked operations.
    pq = p1;
// start by checking if p1 is null, if null don't upcount it
05C6  mov         rbx,qword ptr [rsp+38h]  
05CB  test        rbx,rbx  
05CE  je          main+99h (05D9h)  
// bump the count strong count
05D0  lock inc    dword ptr [rbx+8]  
// save the old control block, for pq, we may need to downcount it
05D4  mov         rbx,qword ptr [rsp+38h]  
// slam the shared pointer values into the target pq
05D9  mov         rax,qword ptr [p1]  
05DE  mov         qword ptr [pq],rax  
05E5  mov         rdi,qword ptr [pq+8h]  
05EC  mov         qword ptr [pq+8h],rbx  
// the downcount pattern in this case is 50 bytes, it varies
// depending on what registers are available, it was 46 bytes above.
// this is the usual stuff, get -1 in esi and then use it to
// downcount
05F3  mov         esi,0FFFFFFFFh  
// we don't do any of this if the pointer was null, we don't
// downcount nulls
05F8  test        rdi,rdi  
05FB  je          main+0EBh (062Bh)  
// once we're here the downcount is happening...
05FD  mov         eax,esi  
05FF  lock xadd   dword ptr [rdi+8],eax 
0604  cmp         eax,1  
0607  jne         main+0E6h (0626h)  
0609  mov         rax,qword ptr [rdi]  
060C  mov         rcx,rdi  
060F  call        qword ptr [rax]  
0611  mov         eax,esi  
0613  lock xadd   dword ptr [rdi+0Ch],eax  
0618  cmp         eax,1  
061B  jne         main+0E6h (0626h)  
061D  mov         rax,qword ptr [rdi]  
0620  mov         rcx,rdi  
0623  call        qword ptr [rax+8]

Wow, so that’s a lot of code. 95 bytes maybe. And there were likely to be 2 interlocked operations and if we’re unlucky it’ll be three. Those operations, well they can be fast, maybe as low as 5ns if the relevant machine words are already in l1 cache and exclusive but if there’s a cache miss that could be maybe 100ns or so, maybe 250 instructions? Not cheap. We might get 3 of those… Of course if we take a page fault then it all goes totally sideways but just the cache misses can increase the cost from 100ish instructions to 500ish instructions.

Similarly, that weak pointer to strong pointer business is not cheap. We have to do repeated trial upgrades using interlocked compare exchange. That’s the classic LOAD and CAS (check and store) operation pattern. We could miss the cache on the initial load and the any number of times on the CAS loop. But hopefully not. Then we still have 1 or 2 interlocked decrements to clean up temporary strong pointer. This is another 100+ byte pattern. It looks small, but it isn’t.

Another thing this really gets you thinking about this this kind of pattern…

class baz {
   shared_ptr<foo> p1; //  strong possibility of false sharing...
   shared_ptr<bar> p2; 
   // p3, p4, ... I've seen cases with half a dozen of these...
};

When you see this business you have to ask yourself, do p1 and p2 really have totally independent lifetime in the context of a baz? Because probably they don't and if those were unique_ptr types whose lifetime is control by baz (which maybe gets its own shared_ptr) you'll get a lot more economy. Internal unique pointers give you automatic cleanup for basically no extra cost (compared to manual).

What really can kill you though is if there’s threading going on. Having contiguous pointers like that means likely the allocations are also contiguous, that sounds good until you get any contention. Then you’ll have multiple control blocks on the same cache line with different threads contending on them. That’s a nightmare of cache misses…

Remember if p1 and p2 are allocated together then their data is more likely to end up nearby on the heap (but maybe it still won’t).

You could save yourself a lot of aggrevation if p1 and p2 were unique pointers and only the outer object had ref counts.

And here we end the teardown.

The main conclusions are:

unique_ptr is pretty optimal
shared_ptr is about as good as it could be, there seems to be room for a few minor optimizations but nothing crazy, however “as good as it could be” is not “cheap” — the pattern to upcount a shared pointer is about 45 bytes and a downcount is about 50 bytes. Upcount has one locked operation, downcount has 2, only one of which usually runs except for the last downcount which does 2. Each locked operation can take the equivalent of several hundred instructions if it doesn’t already have exclusive ownership of the cache line that the control block is on
weak pointer has a single locked upcount for each create and a single downcount for each destruction; however upgrading from weak to strong requires a complex lock cmpxchg loop (this isn’t unexpected) and this can be very costly indeed if the shared pointer is in fact shared
false sharing can be a real issue, so putting shared_ptrs close to each other is usually a bad idea

Be careful out there, these things have sharp edges.

Postscript:

Redditor ReDucTor points out:

// for reasons I do not understand rdi didn’t get stored in the home
// storage in the remaining 64-bit slot. Like many other cases we
// push it old-school. Maybe that’s smaller code, I dunno.
…
// rsp + 48 -> “home storage” slot 3 rdi could have gone here?
The extra 8 bytes are for alignment, the MS ABI defines that the stack will always be 16-byte aligned so the 28h + 8h for return address on call instructions, this ensures when the call occurs its frame also starts on the 16-byte alignment.
This alignment requirement might also be why you see things like in the weakPtrFunction the rdi being pushed onto the stack instead of saved getting saved into a home slot, it already needed to have that extra alignment bit of alignment so why not push it on which is likely smaller encoding then doing a mov into a home storage.

std::pointer types — a tear down and discussion

Written by Rico Mariani

Responses (3)