std::function teardown and discussion

13 min readApr 6, 2023

This is the third and final (for now) C++ teardown in which we look at the code generation for functors.

Note that I have forced functors to be created here, even in this simple example, so that we can look at the code. The below could of course have been written such that the lambda expressions did not materialize functors but the whole point was to look at the functors, so I didn’t let that happen.

I have two different test cases here, the modern functors approach and the old-school minimalist approach. A minimal functor is basically a function pointer and a void * of your choice that comes back to you. This isn't exactly equivalent because of lifetime management differences but it serves as a baseline of “the least you could possibly do and still kind of call it a functor.” The old-school version is dumb as rocks as you will see.

Also, technically functors are a more general thing than what I'm talking about here, more properly this is a discussion the std::function functor. When I talk about functors below, I really mean std::funtion — it’s just less of a mouthful.

For context here is the whole test program including both cases.

#include <iostream>
#include <functional>

__declspec(noinline)
void call_functor(std::function<void(int)> func) {
    func(1000);
}

__declspec(noinline)
void functors_test() {
    int a = 1;
    int b = 2;
    // I have to have two cases with differing args and side-effects
    // or it folds everything which defeats the purpose of the benchmark
    call_functor(
        [a](int x) 
        { 
            printf("f1 %d\n", a + x);  // can't set breakpoint here
        }
    );
    call_functor(
        [b](int x)
        {
            printf("f2 %d\n", b + x);
        }
    );
}

__declspec(noinline)
void call_func_ptr(void* data, void (*pfn)(void *, int))
{
    return pfn(data, 1000);
}

__declspec(noinline)
void function_ptrs_test()
{
    int a = 1;
    int b = 2;
    // I have to have two cases with differing args and side-effects
    // or it folds everything which defeats the purpose of the benchmark
    struct worker1 {
        int a;
        __declspec(noinline) static void go(void *data, int x)
        {
            // recover context old school
            auto w = (worker1*)data;
            printf("p1 %d\n", w->a + x);
        }
    };
    struct worker2 {
        int b;
        __declspec(noinline) static void go(void *data, int x)
        {
            auto w = (worker2*)data; // recover context old school
            printf("p2 %d\n", w->b + x);
        }
    };
    // this is the captured state
    worker1 w1 = { a };
    worker2 w2 = { b };
    // we do two calls with different arguments prevent the compiler
    // from inlining invariant args into the body of the function.
    call_func_ptr(&w1, worker1::go);
    call_func_ptr(&w2, worker2::go);
}

int main()
{
    functors_test();
    function_ptrs_test();
}

Let’s begin with the modern pattern. We’ll start by looking at functors_test then call_functor but before we do, a quick refresher on the layout of the std::function functor will help.

If captured state <= 48 bytes it is stored as part of the functor

If captured state <= 48 bytes it is stored as part of the functor  .-----------------------------------.
  | 00h | anonymous class vtable      | <--.
  |-----|-----------------------------|    |
  | 08h | captured vars, <= 48 bytes  |    |
  |-----|-----------------------------|    |
  | 38h | effective "this"            | ---' points to offset 0
  `-----------------------------------'OR, if the captured vars don't fit, there is an alloc, thus:   .-------------------------.      .------------------------------.
  | 00h | unused, 58 bytes  |  .-> | 00h | anonymous class vtable |
  |-----|-------------------|  |   |-----|------------------------|
  | 38h | effective "this"  | -'   | 08h | captured variables     |
  `-------------------------'      |     | size > 48 bytes        |
                                   `------------------------------'

In both cases the captured state is an anonymous class with a vtable. You can get the effective this pointer by something like:

// ecx unchanged for small case, or updated for big case
mov ecx, qword ptr [ecx+38h]  // assuming ecx points to the functor

OK that’s plenty of context, let’s have a look at the code we get. Keeping in mind we’ll be seeing the pointer dance above.

void functors_test() {
// stack layout after the frame is created:
// note both functors are stored in the same place!
// rsp + 00h : home storage 0
// rsp + 08h : home storage 1
// rsp + 10h : home storage 2
// rsp + 18h : home storage 3
// rsp + 20h : functor captured state vtable
// rsp + 28h : functor captured "a" or "b" value
// rsp + 58h : functor effective this [points to rsp+20]
// rsp + 60h : alignment qword
// rsp + 68h : return address01120  sub   rsp,68h  
     int a = 1;
     int b = 2;// I have to have two cases with differing args and side-effects
// in the code below. I have to do this or the compiler inlines
// arguments into function bodies which I don't want.     call_functor(
// loading the vtable pointer for the anonymous class of the
// functor we are creating from the lambda expression
01124  lea   rax,[std::...<lambda_1>...::`vftable']  // stash the captured state directly in the functor (a = 1)
0112B  mov   dword ptr [rsp+28h],1  // stash the vtable pointer in the functor
01133  mov   qword ptr [rsp+20h],rax  // prepare the functor as the first arg to call_functor
01138  lea   rcx,[rsp+20h]  // stash the functor "this" pointer into the functor storage,
// this is local storage for catured state inside the functor.
// if we had more caputured state this would have been the
// result of a malloc
0113D  lea   rax,[rsp+20h]  
01142  mov   qword ptr [rsp+58h],rax // call the method to dispatch the functor 
01147  call  call_functor  
         [a](int x) 
         { 
// note that there is no assembly for this printf, as we will
// see it's been inlined into a virtual function of std::function
// this is a debugging catastrophe, you can't put a breakpoint
// on this line and it will never show up in any stack trace
             printf("f1 %d\n", a + x);
         }
     );     call_functor(
// exactly the same business for the other functor
0114C  lea   rax,[std::...<lambda_2>...::`vftable']  
01153  mov   dword ptr [rsp+28h],2  
0115B  mov   qword ptr [rsp+20h],rax  
01160  lea   rcx,[rsp+20h]  
01165  lea   rax,[rsp+20h]  
0116A  mov   qword ptr [rsp+58h],rax  
0116F  call  call_functor  
         [b](int x)
         {
             printf("f2 %d\n", b + x);
         }
     );
 }
// restore the stack and we're outta here
01174  add   rsp,68h  
01178  ret

This code is 89 bytes, plus the two vtables each consists of 7 8-byte slots yielding 112 bytes of vtable. The total cost 201 bytes.

Now let’s have a look at the function that actually uses a functor, what does that look like?

void call_functor(std::function<void(int)> func) {// the frame looks like this:
//
// rsp + 00h : home storage 0 for our callees
// rsp + 08h : home storage 1 for our callees
// rsp + 10h : home storage 2 for our callees
// rsp + 18h : home storage 3 for our callees
// rsp + 20h : "func"
// rsp + 28h : temp storage (holds 1000)
// rsp + 30h : security swizzle 
// rsp + 38h : 
// rsp + 40h : pushed rbx
// rsp + 48h : return address
// rsp + 50h : home storage 0 for us, why isn't "func" here?
// rsp + 58h : home storage 1 for us, why isn't "1000" here?
// rsp + 60h : home storage 2 for us
// rsp + 68h : home storage 3 for us01070  push  rbx  
01072  sub   rsp,40h  // stash the security cookie on the stack
01076  mov   rax,qword ptr [__security_cookie]  
0107D  xor   rax,rsp  
01080  mov   qword ptr [rsp+30h],rax  // save our incoming arg, this is the pointer to the functor, 
// keep in a local variable "func" and also in rbx
01085  mov   rbx,rcx  
01088  mov   qword ptr [rsp+20h],rcx       func(1000);
// stash the arg we will need later, 1000, on the stack
// in temp storage, functors pass all args as an array arg
0108D  mov   dword ptr [rsp+28h],3E8h  // use rcx, which is the pointer to the functor to get the
// offset to the "this" pointer. We have to do this because
// the storage for the capture state might be on the heap
// so [rcx+38h] will either point to the heap or back to
// itself, in our case this rcx+38 will in fact point to rcx
// because our captured state is small in this demo.
01095  mov   rcx,qword ptr [rcx+38h]  // if the this pointer is null something has gone very wrong... 
01099  test  rcx,rcx  
0109C  jne   call_functor+35h  // if the method is null, we fault right here
0109E  call  qword ptr [__imp_std::_Xbad_function_call]  
010A4  int   3  // now we have what you could call a normal object pointer
// back in rcx we load the vtable pointer into rax
010A5  mov   rax,qword ptr [rcx]  // now we reload the address of our arg (1000) from where
// we stashed it into rdx, so we're set to make a two-arg call
010A8  lea   rdx,[rsp+28h]  // the args for the call are going to be the "this" pointer
// for the anonymous class that is the capture in rcx and
// the pointer to the other args in rdx
010AD  call  qword ptr [rax+10h]  
010B0  nop  
 }// we're back from the call now we have to destroy the functor,
// we once again fetch the effective this pointer for the
// captured values
010B1  mov   rcx,qword ptr [rbx+38h]  
010B5  test  rcx,rcx  // if the saved this is null (it really can't be a this point but ok)
// then we skip the cleanup
010B8  je    call_functor+5Eh  // given an ok this pointer, now we are going to call the
// decallocation code we compute a bool which is true if the
// stored this is not the same as the functor pointer itself
// this tell us that there is storage to free.  Recall that rbx
// has the incoming functor pointer and rcx is the effective this
// so comparing them tells us if there is an allocation.
010BA  mov   rax,qword ptr [rcx]  // vtable
010BD  cmp   rcx,rbx  // rcx "this", rbx is functor
010C0  setne dl  // dl true if not rcx != rbx// call cleanup from the vtable
010C3  call  qword ptr [rax+20h]// functor has been destroyed, clobber the "this" pointer in it
010C6  mov   qword ptr [rbx+38h],0  // recompute the security cookie
010CE  mov   rcx,qword ptr [rsp+30h]  
010D3  xor   rcx,rsp  
010D6  call  __security_check_cookie  // restore the stack and we're done
010DB  add   rsp,40h  
010DF  pop   rbx  
010E0  ret

So we see the code to call a functor is 113 bytes.

Let’s have a look at the actual lambda bodies now.

Here we see the problem we mentioned above, the body of our function has been inlined into _Do_call in the <functional> header and the debug information somehow isn't right, the line numbers do not cross link to the original source. This is a catastrophe for debugging as I mentioned above. However, it's not a fatal flaw in functor design, it's just a bug.

...\include\functional
_Rx _Do_call(_Types&&... _Args) override { // call wrapped function
return _Invoker_ret<_Rx>::_Call(_Callee, _STD forward<_Types>(_Args)...);// we take the pointer to our args and stash it in rax
012B0  mov   rax,rdx  // here we recover our captured state, loading it into edx (32 bits)
012B3  mov   edx,dword ptr [rcx+8]  // we load the first arg for printf, the string
012B6  lea   rcx,[string "f1 %d\n"]  // we compute the required sum, adding the captured value edx and
// the argument, we have a pointer to the args in rax
// and of course we do 32 bit math because it's all ints, not int64
012BD  add   edx,dword ptr [rax]// tail call to printf
012BF  jmp   printf

This code is 19 bytes. The second functor generates exactly the same code with the same inlining problem, so I won’t repeat it. That’s another 19 bytes.

Let’s look at some of the required helpers, we have to emit these so that the vtable can point to them.

This one is our deallocation path:

...\include\functionalthis->~_Func_impl_no_alloc();
    if (_Dealloc) {
// entry point for our deallocator
// recall dl tells us if a free is needed     
01230  test  dl,dl  
// if no alloc, then skip
01232  je    0123E    _Deallocate<alignof(_Func_impl_no_alloc)>
       (this, sizeof(_Func_impl_no_alloc));// ecx came in with the this pointer, and the length is 16 bytes,
// do the free. This is never going to run in our case; it seems
// like we could know that we don't need this path at compile time
// but the template doesn't quite figure it out even though it knows
// the alloc size is 16 bytes hence too small to need an alloc.  
// If we passed in the size rather than the bool we might be able
// to figure this branch out at compile time.
01234  mov   edx,10h  
// tail call to delete
01239  jmp   operator delete // this never runs, our block is small
         }
     }
// small size, normal return
0123E  ret

This is a 14 byte helper.

Next we have a helper that computes the base address of the captured variables

return _STD addressof(_Callee);
// offset past the vtable pointer and that's it
01240  lea   rax,[rcx+8]  
01244  ret

That’s a thin 5 bytes.

Next we have code for a not-inlined version of destructor: this is never called… but it’s virtual so it has to be there… Note that it delegates to the same cleanup code as before, which is kind of like its base destructor. It uses the same trick loading dl with a boolean and calling a dealloc helper. This is all here even though we only captured an int.

std::function<void __cdecl(int)>::~function<void __cdecl(int)>(void):
010F0  push  rbx  
010F2  sub   rsp,20h// stash the incoming this in rbx, a preserved register
010F6  mov   rbx,rcx// get the effective this pointer  
010F9  mov   rcx,qword ptr [rcx+38h]  
010FD  test  rcx,rcx  // a null effective this indicates destruction has already happened
// or at least is not needed.  Skip everything.
01100  je    01116// set up for the comparison and leave the result in dl like before
01102  mov   rax,qword ptr [rcx]  
01105  cmp   rcx,rbx  
01108  setne dl  // maybe delete the allocated block
0110B  call  qword ptr [rax+20h]  // null out the effective this pointer
0110E  mov   qword ptr [rbx+38h],0  // cleanup the frame
01116  add   rsp,20h  
0111A  pop   rbx  
0111B  ret

This is 44 bytes.

Finally, this one seems to be a copy constructor. There may be some others that I missed but let’s stop here.

return ::new (_Where) _Func_impl_no_alloc(_Callee);// put the vtable into the target
012D0  lea   rax,[...:<lambda_1>...::`vftable']  
012D7  mov   qword ptr [rdx],rax// copy the captured int
012DA  mov   eax,dword ptr [rcx+8]  
012DD  mov   dword ptr [rdx+8],eax// return the target
012E0  mov   rax,rdx  
012E3  ret

It’s 19 bytes.

Let’s total this up:

functor_test:  89 + 112 = 201
call_functor              113
lambda1                    19
lambda1                    19
conditional dealloc        14
this calc                   5
destructor                 44
copy constructor           19
                        -----
                          434
                        =====

That’s a total of 434 bytes to do two lambda calls. Now I’m inclined to remove the cost of the body of the lambdas because we’re looking at the functor overhead not lambda generated code. This is cheating a little because there is a weird arg convention that affects the lambda codegen but I think we could argue that it’s 396 bytes of functor stuff. And any virtual function bodies I missed.

Let’s now look at the “old school” method. This isn’t completely fair but it is a good floor in terms of “what’s the least you could do”. So with that in mind, let’s have a look first at the function_ptrs_test disassembly.

void function_ptrs_test()
{
// frame after it's been set up
// rsp + 00h : home storage 0
// rsp + 08h : home storage 1
// rsp + 10h : home storage 2
// rsp + 18h : home storage 3
// rsp + 20h
// rsp + 28h : security cookie
// rsp + 30h
// rsp + 38h : return address// set up the frame        
01190  sub   rsp,38h// store the security cookie
01194  mov   rax,qword ptr [__security_cookie]  
0119B  xor   rax,rsp  
0119E  mov   qword ptr [rsp+28h],rax // this next section is a lot of ceremony but it makes no code!     int a = 1;
     int b = 2;     // I have to have two cases with differing args
     // and side-effects or it folds everything which defeats
     // the purpose of the benchmark     struct worker1 {
        int a;        __declspec(noinline) static void go(void *data, int x)
        {
            // recover context old school
            auto w = (worker1*)data;
            printf("p1 %d\n", w->a + x);
        }
     };      struct worker2 {
         int b;         __declspec(noinline) static void go(void *data, int x)
         {
             auto w = (worker2*)data; // recover context old school
             printf("p2 %d\n", w->b + x);
         }
     };     // this is the captured state
     worker1 w1 = { a };
     worker2 w2 = { b };          call_func_ptr(&w1, worker1::go);// load the address of worker1:go into rdx, that's the 2nd arg
011A3  lea   rdx,[`function_ptrs_test'::`2'::worker1::go]  // prepare w1
011AA  mov   dword ptr [w1],1  // get the address of w1 this will be our void*
011B2  lea   rcx,[w1]  // pre load 2 (overlapped) for the next call
011B7  mov   dword ptr [w2],2  // dispatch first function pointer
011BF  call  call_func_ptr       call_func_ptr(&w2, worker2::go);// load worker2::go into rdx
011C4  lea   rdx,[`function_ptrs_test'::`2'::worker2::go]  // get the address of w2 for the second call
011CB  lea   rcx,[w2]  // dispatch the second call
011D0  call  call_func_ptr  
 }// test security cookie ok exit, note that no destructors were
// required and none were emitted
011D5  mov   rcx,qword ptr [rsp+28h]  
011DA  xor   rcx,rsp  
011DD  call  __security_check_cookie  
// cleanup the frame and we're done
011E2  add   rsp,38h  
011E6  ret

This is 86 bytes of code. No vtables. 28 bytes were security cookie management (same as the other).

Now let’s look at the pointer call. This will also be very simple:

void call_func_ptr(void* data, void (*pfn)(void *, int))
{
// move the function pointer we need to call into rax
01180  mov   rax,rdx       return pfn(data, 1000);// move the 100 into rdx, arg2
// arg1 is already good to go and in rcx, we do nothing
01183  mov   edx,3E8h// tail call, args in rcx and rdx as usual
01188  jmp   rax

The old school functor caller is a thin 10 bytes.

Next, we look at both functions… note that they have good source info, and they will have good symbol names in stacks.

// recover context old school
auto w = (worker1*)data;
printf("p1 %d\n", w->a + x);// do the addition w->a + x in one op
011F0  add   edx,dword ptr [rcx]// get the format string
011F2  lea   rcx,[string "p1 %d\n"]  // tail call
011F9  jmp   printf

Both functions are the same, and they are 13 byte each. The only difference in the generated code is which string literal they use.

That’s all there is…

Here’s the final tally:

function_ptrs_test         86
call_func_ptr              10
lambda1                    13
lambda1                    13
                        -----
                          122
                        =====

Now comparing the old school cost, vs. modern C++. We exclude the body of the functors from both, even though the C++ one is a bit worse… they can be arbitrarily big.

Then we get

Modern C++: 434–19–19 = 396 bytes.

Old school: 122–13–13 = 96 bytes.

That’s a difference of 300 bytes, or a factor of 4.125. This factor is basically the cost associated with setup and teardown associated with a simple functor pattern.

For comparison, to get a 4.1x slowdown you need to compare these processors (single threaded). So, you’re downgrading your core to a 2010 processor. And I think a 4.1x growth in size resulting in a 4.1x slowdown in speed is actually generous, it’s probably a lot worse with all those non-local calls.

2023: Intel Core i9-13900KS           threadmark: 4794
2010: Intel Pentium G6951 @ 2.80GHz   threadmark: 1173

If we were to discount the fixed overhead for the security cookie (28 bytes each) it ends up as 368 vs. 68 or a factor of 5.4.

2008: Intel Pentium E2220 @ 2.40GHz   threadmark: 895

Yeah, that’s not nothing.

std::function teardown and discussion

Written by Rico Mariani

Responses (1)