std::function teardown and discussion
This is the third and final (for now) C++ teardown in which we look at the code generation for functors.
Note that I have forced functors to be created here, even in this simple example, so that we can look at the code. The below could of course have been written such that the lambda expressions did not materialize functors but the whole point was to look at the functors, so I didn’t let that happen.
I have two different test cases here, the modern functors approach and the old-school minimalist approach. A minimal functor is basically a function pointer and a void *
of your choice that comes back to you. This isn't exactly equivalent because of lifetime management differences but it serves as a baseline of “the least you could possibly do and still kind of call it a functor.” The old-school version is dumb as rocks as you will see.
Also, technically functors are a more general thing than what I'm talking about here, more properly this is a discussion the std::function
functor. When I talk about functors below, I really mean std::funtion
— it’s just less of a mouthful.
For context here is the whole test program including both cases.
#include <iostream>
#include <functional>
__declspec(noinline)
void call_functor(std::function<void(int)> func) {
func(1000);
}
__declspec(noinline)
void functors_test() {
int a = 1;
int b = 2;
// I have to have two cases with differing args and side-effects
// or it folds everything which defeats the purpose of the benchmark
call_functor(
[a](int x)
{
printf("f1 %d\n", a + x); // can't set breakpoint here
}
);
call_functor(
[b](int x)
{
printf("f2 %d\n", b + x);
}
);
}
__declspec(noinline)
void call_func_ptr(void* data, void (*pfn)(void *, int))
{
return pfn(data, 1000);
}
__declspec(noinline)
void function_ptrs_test()
{
int a = 1;
int b = 2;
// I have to have two cases with differing args and side-effects
// or it folds everything which defeats the purpose of the benchmark
struct worker1 {
int a;
__declspec(noinline) static void go(void *data, int x)
{
// recover context old school
auto w = (worker1*)data;
printf("p1 %d\n", w->a + x);
}
};
struct worker2 {
int b;
__declspec(noinline) static void go(void *data, int x)
{
auto w = (worker2*)data; // recover context old school
printf("p2 %d\n", w->b + x);
}
};
// this is the captured state
worker1 w1 = { a };
worker2 w2 = { b };
// we do two calls with different arguments prevent the compiler
// from inlining invariant args into the body of the function.
call_func_ptr(&w1, worker1::go);
call_func_ptr(&w2, worker2::go);
}
int main()
{
functors_test();
function_ptrs_test();
}
Let’s begin with the modern pattern. We’ll start by looking at functors_test
then call_functor
but before we do, a quick refresher on the layout of the std::function
functor will help.
If captured state <= 48 bytes it is stored as part of the functor
If captured state <= 48 bytes it is stored as part of the functor .-----------------------------------.
| 00h | anonymous class vtable | <--.
|-----|-----------------------------| |
| 08h | captured vars, <= 48 bytes | |
|-----|-----------------------------| |
| 38h | effective "this" | ---' points to offset 0
`-----------------------------------'OR, if the captured vars don't fit, there is an alloc, thus: .-------------------------. .------------------------------.
| 00h | unused, 58 bytes | .-> | 00h | anonymous class vtable |
|-----|-------------------| | |-----|------------------------|
| 38h | effective "this" | -' | 08h | captured variables |
`-------------------------' | | size > 48 bytes |
`------------------------------'
In both cases the captured state is an anonymous class with a vtable. You can get the effective this pointer by something like:
// ecx unchanged for small case, or updated for big case
mov ecx, qword ptr [ecx+38h] // assuming ecx points to the functor
OK that’s plenty of context, let’s have a look at the code we get. Keeping in mind we’ll be seeing the pointer dance above.
void functors_test() {
// stack layout after the frame is created:
// note both functors are stored in the same place!
// rsp + 00h : home storage 0
// rsp + 08h : home storage 1
// rsp + 10h : home storage 2
// rsp + 18h : home storage 3
// rsp + 20h : functor captured state vtable
// rsp + 28h : functor captured "a" or "b" value
// rsp + 58h : functor effective this [points to rsp+20]
// rsp + 60h : alignment qword
// rsp + 68h : return address01120 sub rsp,68h
int a = 1;
int b = 2;// I have to have two cases with differing args and side-effects
// in the code below. I have to do this or the compiler inlines
// arguments into function bodies which I don't want. call_functor(
// loading the vtable pointer for the anonymous class of the
// functor we are creating from the lambda expression
01124 lea rax,[std::...<lambda_1>...::`vftable'] // stash the captured state directly in the functor (a = 1)
0112B mov dword ptr [rsp+28h],1 // stash the vtable pointer in the functor
01133 mov qword ptr [rsp+20h],rax // prepare the functor as the first arg to call_functor
01138 lea rcx,[rsp+20h] // stash the functor "this" pointer into the functor storage,
// this is local storage for catured state inside the functor.
// if we had more caputured state this would have been the
// result of a malloc
0113D lea rax,[rsp+20h]
01142 mov qword ptr [rsp+58h],rax // call the method to dispatch the functor
01147 call call_functor
[a](int x)
{
// note that there is no assembly for this printf, as we will
// see it's been inlined into a virtual function of std::function
// this is a debugging catastrophe, you can't put a breakpoint
// on this line and it will never show up in any stack trace
printf("f1 %d\n", a + x);
}
); call_functor(
// exactly the same business for the other functor
0114C lea rax,[std::...<lambda_2>...::`vftable']
01153 mov dword ptr [rsp+28h],2
0115B mov qword ptr [rsp+20h],rax
01160 lea rcx,[rsp+20h]
01165 lea rax,[rsp+20h]
0116A mov qword ptr [rsp+58h],rax
0116F call call_functor
[b](int x)
{
printf("f2 %d\n", b + x);
}
);
}
// restore the stack and we're outta here
01174 add rsp,68h
01178 ret
This code is 89 bytes, plus the two vtables each consists of 7 8-byte slots yielding 112 bytes of vtable. The total cost 201 bytes.
Now let’s have a look at the function that actually uses a functor, what does that look like?
void call_functor(std::function<void(int)> func) {// the frame looks like this:
//
// rsp + 00h : home storage 0 for our callees
// rsp + 08h : home storage 1 for our callees
// rsp + 10h : home storage 2 for our callees
// rsp + 18h : home storage 3 for our callees
// rsp + 20h : "func"
// rsp + 28h : temp storage (holds 1000)
// rsp + 30h : security swizzle
// rsp + 38h :
// rsp + 40h : pushed rbx
// rsp + 48h : return address
// rsp + 50h : home storage 0 for us, why isn't "func" here?
// rsp + 58h : home storage 1 for us, why isn't "1000" here?
// rsp + 60h : home storage 2 for us
// rsp + 68h : home storage 3 for us01070 push rbx
01072 sub rsp,40h // stash the security cookie on the stack
01076 mov rax,qword ptr [__security_cookie]
0107D xor rax,rsp
01080 mov qword ptr [rsp+30h],rax // save our incoming arg, this is the pointer to the functor,
// keep in a local variable "func" and also in rbx
01085 mov rbx,rcx
01088 mov qword ptr [rsp+20h],rcx func(1000);
// stash the arg we will need later, 1000, on the stack
// in temp storage, functors pass all args as an array arg
0108D mov dword ptr [rsp+28h],3E8h // use rcx, which is the pointer to the functor to get the
// offset to the "this" pointer. We have to do this because
// the storage for the capture state might be on the heap
// so [rcx+38h] will either point to the heap or back to
// itself, in our case this rcx+38 will in fact point to rcx
// because our captured state is small in this demo.
01095 mov rcx,qword ptr [rcx+38h] // if the this pointer is null something has gone very wrong...
01099 test rcx,rcx
0109C jne call_functor+35h // if the method is null, we fault right here
0109E call qword ptr [__imp_std::_Xbad_function_call]
010A4 int 3 // now we have what you could call a normal object pointer
// back in rcx we load the vtable pointer into rax
010A5 mov rax,qword ptr [rcx] // now we reload the address of our arg (1000) from where
// we stashed it into rdx, so we're set to make a two-arg call
010A8 lea rdx,[rsp+28h] // the args for the call are going to be the "this" pointer
// for the anonymous class that is the capture in rcx and
// the pointer to the other args in rdx
010AD call qword ptr [rax+10h]
010B0 nop
}// we're back from the call now we have to destroy the functor,
// we once again fetch the effective this pointer for the
// captured values
010B1 mov rcx,qword ptr [rbx+38h]
010B5 test rcx,rcx // if the saved this is null (it really can't be a this point but ok)
// then we skip the cleanup
010B8 je call_functor+5Eh // given an ok this pointer, now we are going to call the
// decallocation code we compute a bool which is true if the
// stored this is not the same as the functor pointer itself
// this tell us that there is storage to free. Recall that rbx
// has the incoming functor pointer and rcx is the effective this
// so comparing them tells us if there is an allocation.
010BA mov rax,qword ptr [rcx] // vtable
010BD cmp rcx,rbx // rcx "this", rbx is functor
010C0 setne dl // dl true if not rcx != rbx// call cleanup from the vtable
010C3 call qword ptr [rax+20h]// functor has been destroyed, clobber the "this" pointer in it
010C6 mov qword ptr [rbx+38h],0 // recompute the security cookie
010CE mov rcx,qword ptr [rsp+30h]
010D3 xor rcx,rsp
010D6 call __security_check_cookie // restore the stack and we're done
010DB add rsp,40h
010DF pop rbx
010E0 ret
So we see the code to call a functor is 113 bytes.
Let’s have a look at the actual lambda bodies now.
Here we see the problem we mentioned above, the body of our function has been inlined into _Do_call
in the <functional>
header and the debug information somehow isn't right, the line numbers do not cross link to the original source. This is a catastrophe for debugging as I mentioned above. However, it's not a fatal flaw in functor design, it's just a bug.
...\include\functional
_Rx _Do_call(_Types&&... _Args) override { // call wrapped function
return _Invoker_ret<_Rx>::_Call(_Callee, _STD forward<_Types>(_Args)...);// we take the pointer to our args and stash it in rax
012B0 mov rax,rdx // here we recover our captured state, loading it into edx (32 bits)
012B3 mov edx,dword ptr [rcx+8] // we load the first arg for printf, the string
012B6 lea rcx,[string "f1 %d\n"] // we compute the required sum, adding the captured value edx and
// the argument, we have a pointer to the args in rax
// and of course we do 32 bit math because it's all ints, not int64
012BD add edx,dword ptr [rax]// tail call to printf
012BF jmp printf
This code is 19 bytes. The second functor generates exactly the same code with the same inlining problem, so I won’t repeat it. That’s another 19 bytes.
Let’s look at some of the required helpers, we have to emit these so that the vtable can point to them.
This one is our deallocation path:
...\include\functionalthis->~_Func_impl_no_alloc();
if (_Dealloc) {
// entry point for our deallocator
// recall dl tells us if a free is needed
01230 test dl,dl
// if no alloc, then skip
01232 je 0123E _Deallocate<alignof(_Func_impl_no_alloc)>
(this, sizeof(_Func_impl_no_alloc));// ecx came in with the this pointer, and the length is 16 bytes,
// do the free. This is never going to run in our case; it seems
// like we could know that we don't need this path at compile time
// but the template doesn't quite figure it out even though it knows
// the alloc size is 16 bytes hence too small to need an alloc.
// If we passed in the size rather than the bool we might be able
// to figure this branch out at compile time.
01234 mov edx,10h
// tail call to delete
01239 jmp operator delete // this never runs, our block is small
}
}
// small size, normal return
0123E ret
This is a 14 byte helper.
Next we have a helper that computes the base address of the captured variables
return _STD addressof(_Callee);
// offset past the vtable pointer and that's it
01240 lea rax,[rcx+8]
01244 ret
That’s a thin 5 bytes.
Next we have code for a not-inlined version of destructor: this is never called… but it’s virtual so it has to be there… Note that it delegates to the same cleanup code as before, which is kind of like its base destructor. It uses the same trick loading dl with a boolean and calling a dealloc helper. This is all here even though we only captured an int.
std::function<void __cdecl(int)>::~function<void __cdecl(int)>(void):
010F0 push rbx
010F2 sub rsp,20h// stash the incoming this in rbx, a preserved register
010F6 mov rbx,rcx// get the effective this pointer
010F9 mov rcx,qword ptr [rcx+38h]
010FD test rcx,rcx // a null effective this indicates destruction has already happened
// or at least is not needed. Skip everything.
01100 je 01116// set up for the comparison and leave the result in dl like before
01102 mov rax,qword ptr [rcx]
01105 cmp rcx,rbx
01108 setne dl // maybe delete the allocated block
0110B call qword ptr [rax+20h] // null out the effective this pointer
0110E mov qword ptr [rbx+38h],0 // cleanup the frame
01116 add rsp,20h
0111A pop rbx
0111B ret
This is 44 bytes.
Finally, this one seems to be a copy constructor. There may be some others that I missed but let’s stop here.
return ::new (_Where) _Func_impl_no_alloc(_Callee);// put the vtable into the target
012D0 lea rax,[...:<lambda_1>...::`vftable']
012D7 mov qword ptr [rdx],rax// copy the captured int
012DA mov eax,dword ptr [rcx+8]
012DD mov dword ptr [rdx+8],eax// return the target
012E0 mov rax,rdx
012E3 ret
It’s 19 bytes.
Let’s total this up:
functor_test: 89 + 112 = 201
call_functor 113
lambda1 19
lambda1 19
conditional dealloc 14
this calc 5
destructor 44
copy constructor 19
-----
434
=====
That’s a total of 434 bytes to do two lambda calls. Now I’m inclined to remove the cost of the body of the lambdas because we’re looking at the functor overhead not lambda generated code. This is cheating a little because there is a weird arg convention that affects the lambda codegen but I think we could argue that it’s 396 bytes of functor stuff. And any virtual function bodies I missed.
Let’s now look at the “old school” method. This isn’t completely fair but it is a good floor in terms of “what’s the least you could do”. So with that in mind, let’s have a look first at the function_ptrs_test
disassembly.
void function_ptrs_test()
{
// frame after it's been set up
// rsp + 00h : home storage 0
// rsp + 08h : home storage 1
// rsp + 10h : home storage 2
// rsp + 18h : home storage 3
// rsp + 20h
// rsp + 28h : security cookie
// rsp + 30h
// rsp + 38h : return address// set up the frame
01190 sub rsp,38h// store the security cookie
01194 mov rax,qword ptr [__security_cookie]
0119B xor rax,rsp
0119E mov qword ptr [rsp+28h],rax // this next section is a lot of ceremony but it makes no code! int a = 1;
int b = 2; // I have to have two cases with differing args
// and side-effects or it folds everything which defeats
// the purpose of the benchmark struct worker1 {
int a; __declspec(noinline) static void go(void *data, int x)
{
// recover context old school
auto w = (worker1*)data;
printf("p1 %d\n", w->a + x);
}
}; struct worker2 {
int b; __declspec(noinline) static void go(void *data, int x)
{
auto w = (worker2*)data; // recover context old school
printf("p2 %d\n", w->b + x);
}
}; // this is the captured state
worker1 w1 = { a };
worker2 w2 = { b }; call_func_ptr(&w1, worker1::go);// load the address of worker1:go into rdx, that's the 2nd arg
011A3 lea rdx,[`function_ptrs_test'::`2'::worker1::go] // prepare w1
011AA mov dword ptr [w1],1 // get the address of w1 this will be our void*
011B2 lea rcx,[w1] // pre load 2 (overlapped) for the next call
011B7 mov dword ptr [w2],2 // dispatch first function pointer
011BF call call_func_ptr call_func_ptr(&w2, worker2::go);// load worker2::go into rdx
011C4 lea rdx,[`function_ptrs_test'::`2'::worker2::go] // get the address of w2 for the second call
011CB lea rcx,[w2] // dispatch the second call
011D0 call call_func_ptr
}// test security cookie ok exit, note that no destructors were
// required and none were emitted
011D5 mov rcx,qword ptr [rsp+28h]
011DA xor rcx,rsp
011DD call __security_check_cookie
// cleanup the frame and we're done
011E2 add rsp,38h
011E6 ret
This is 86 bytes of code. No vtables. 28 bytes were security cookie management (same as the other).
Now let’s look at the pointer call. This will also be very simple:
void call_func_ptr(void* data, void (*pfn)(void *, int))
{
// move the function pointer we need to call into rax
01180 mov rax,rdx return pfn(data, 1000);// move the 100 into rdx, arg2
// arg1 is already good to go and in rcx, we do nothing
01183 mov edx,3E8h// tail call, args in rcx and rdx as usual
01188 jmp rax
The old school functor caller is a thin 10 bytes.
Next, we look at both functions… note that they have good source info, and they will have good symbol names in stacks.
// recover context old school
auto w = (worker1*)data;
printf("p1 %d\n", w->a + x);// do the addition w->a + x in one op
011F0 add edx,dword ptr [rcx]// get the format string
011F2 lea rcx,[string "p1 %d\n"] // tail call
011F9 jmp printf
Both functions are the same, and they are 13 byte each. The only difference in the generated code is which string literal they use.
That’s all there is…
Here’s the final tally:
function_ptrs_test 86
call_func_ptr 10
lambda1 13
lambda1 13
-----
122
=====
Now comparing the old school cost, vs. modern C++. We exclude the body of the functors from both, even though the C++ one is a bit worse… they can be arbitrarily big.
Then we get
Modern C++: 434–19–19 = 396 bytes.
Old school: 122–13–13 = 96 bytes.
That’s a difference of 300 bytes, or a factor of 4.125. This factor is basically the cost associated with setup and teardown associated with a simple functor pattern.
For comparison, to get a 4.1x slowdown you need to compare these processors (single threaded). So, you’re downgrading your core to a 2010 processor. And I think a 4.1x growth in size resulting in a 4.1x slowdown in speed is actually generous, it’s probably a lot worse with all those non-local calls.
2023: Intel Core i9-13900KS threadmark: 4794
2010: Intel Pentium G6951 @ 2.80GHz threadmark: 1173
If we were to discount the fixed overhead for the security cookie (28 bytes each) it ends up as 368 vs. 68 or a factor of 5.4.
2008: Intel Pentium E2220 @ 2.40GHz threadmark: 895
Yeah, that’s not nothing.