C# Lambdas: A Code Teardown
Continuing in my series of teardowns, I took some time to do an assembly language teardown of some C# code that uses lambdas in a very simple way. The disassembled code below is a release build for Windows, so it uses the Windows ABI.
The complete source code for the test case is as follows:
using System;
class MyProgram
{
static void call_functor(Action<int> func)
{
func(1000);
}
static void functors_test()
{
int a = 1;
var x = (int x) =>
{
Console.WriteLine("f1 {0}", a + x);
};
call_functor(x);
}
static void Main(string[] args)
{
functors_test();
}
}
Now we know that .NET binaries include significant overhead for garbage collector information, for exception handling and for class metadeta. This is magnified for toy examples so one should take the percentage of overhead here with a large grain of salt. However, the code quality situation and the heap situation is pretty representative of what goes on in .NET more generally and there are no array bounds checks here. A helper function is used which hides the cost of the write-barrier code but this is fair in that the cost to any given user is a call instead of a store — at least from a size perspective.
Analysis
Let’s dig into this code, we’ll look at ‘Main’ first. It doesn’t play a part in the comparison because in the rust example the equivalent of functors_test
was main
(i.e. there was no helper). But still it's easy and it's a place to start.
// this is the code for main, it's 15 bytes
static void Main(string[] args)
{
functors_test();
// this the standard preamble for a standard static function
// note that it reserves the extra home storage space on the
// for callees per the Windows ABI
00000F7CF0 sub rsp,28h
00000F7CF4 call CLRStub[MethodDescPrestub]@0de3e8 (00DE3E8h)
// I dunno why we need this, maybe alignment for EH or something...
00000F7CF9 nop
// cleanup and return
00000F7CFA add rsp,28h
00000F7CFE ret
// assorted GC Info and other metadata stuff (33 bytes)
0xF7CFF 0xF7D20 [stuff]
OK that was simple enough, a little 15 byte baby function.
Now that we’ve got our feet wet let’s look at our first “real” function. Here is functors_test
.
// this function is 72 bytes
static void functors_test()
{
int a = 1;
// standard register storage and home storage reservation
00000F7D20 push rdi
00000F7D21 push rsi
00000F7D22 sub rsp,28h
// here we make a our shared frame object for captured values
// and store the constant 1 in it, note that the locals
// were hoisted onto the heap because in .NET lambda capture
// is always by reference
00000F7D26 mov rcx,212AA0h
00000F7D30 call CORINFO_HELP_NEWSFAST
00000F7D35 mov rsi,rax
00000F7D38 mov dword ptr [rsi+8],1
// here we make our lambda object
00000F7D3F mov rcx,212D38h
00000F7D49 call CORINFO_HELP_NEWSFAST
00000F7D4E mov rdi,rax
// this is where the captured state will go in the lambda object
00000F7D51 lea rcx,[rdi+8]
// rsi had our shared frame object we have to stash it into
// the lambda that we just made
00000F7D55 mov rdx,rsi
00000F7D58 call CORINFO_HELP_ASSIGN_REF
// now we need to stash the target address of the stub
// that is the actual body of our lambda
00000F7D5D mov rcx,offset CLRStub[MethodDescPrestub]@0f7770
00000F7D67 mov qword ptr [rdi+18h],rcx
// we have a valid lambda at this point, we can use it!
call_functor(x);
Note that we needed two heap allocations here, one for the lambda object (remember every lambda is an anonymous class) and we needed another for the captured locals. In .NET the captured locals are hoisted into a heap object that will be shared by all lambdas defined in the function. This is how capture by reference works in .NET.
The actual call call_functor
was inlined which I didn’t allow to happen in the native cases. This change gives C# a small advantage, maybe 10 bytes for the call and the saved post-amble.
What follows next is the inline version of call_functor
. Note: In the native language test cases was forced to make it not inline because otherwise the native optimizations basically make the entire lambda vanish. That doesn't happen in .NET. The .NET code is slightly more realistic.
// this effectively the code for call_functor, it is 20 bytes
// we get the captured state and load it up into rcx
00000F7D6B mov rcx,qword ptr [rdi+8]
// the argument is 1000 (we're calling functor(1000)),
// this was inlined because it's always a constant here
// so the argument just flowed.
00000F7D6F mov edx,3E8h
// now fetch the target of the call from the lambda
00000F7D74 mov rax,qword ptr [rdi+18h]
// next clean up the stack
00000F7D78 add rsp,28h
00000F7D7C pop rsi
00000F7D7D pop rdi
// and finally, tail call the actual lambda body
00000F7D7E jmp rax
// 1529 bytes of overhead metadata etc. There is quite a lot in this chunk.
// This chunk probably includes overhead for other helpers that happened
// to land here in th text section. Still, it all counts.
00000F7D81 00000F837A
So we get a 20-byte function to invoke the lambda.
Last but not least the code block associated with the lambda. Here we notice that we used Console.WriteLine
which is “varargs” and so the arguments had to be boxed for consistency.
// The lambda body is 59 bytes
Console.WriteLine("f1 {0}", a + x);
00000F8140 push rdi
00000F8141 push rsi
00000F8142 sub rsp,28h
// This is the "this" pointer for the lambda.
00000F8146 mov rsi,rcx
// This is "x" the incoming argument, stash it in edi.
00000F8149 mov edi,edx
// we make a new object of type 165FD0h
00000F814B mov rcx,165FD0h
00000F8155 call CORINFO_HELP_NEWSFAST
// Now add the 'a' variable field to 'x' arg and store it in edi.
00000F815A add edi,dword ptr [rsi+8]
// This the computed int argument for Console.Writeline,
// it arrives as a boxed integer.
00000F815D mov dword ptr [rax+8],edi
00000F8160 mov rdx,rax
// This is a string handle for a string literal which we fetch
// it will be the first argument, i.e., the format string.
00000F8163 mov rcx,26718005E78h
00000F816D mov rcx,qword ptr [rcx]
// Now we clean up the stack and make the call to WriteLine.
00000F8170 add rsp,28h
00000F8174 pop rsi
00000F8175 pop rdi
// Tail call optimized invocation of Console.WriteLine.
00000F8176 jmp CLRStub[MethodDescPrestub]@0f8048 (00F8048h)
// Overhead, GCInfo etc. for this function 18 bytes
00000F817B to 00000F818D
Comparison
So, keeping the costs above in mind, how do we fare?
Well, disregarding the metadata overheads, examining only raw code size we get 151 bytes C# vs. 70 bytes Rust. More than a factor of two. If you consider the code that runs for the helper functions and the marginal cost of the extra heap allocations it’s pretty easy to imagine that the true CPU overhead of C# vs. Rust will be more like a factor of 3. And the above is just simple idiomatic C#.
C# Rust
functor_test: 72 bytes 47 bytes
call_functor: 20 bytes 9 bytes
lambda body: 59 bytes 14 bytes
total code: 151 bytes 70 bytes
extra inline savings: ~10 bytes 0 bytes *
heap allocations: 3 allocs 0 allocs
additional overhead 1580 bytes 0 bytes
code + overhead 1650 bytes 70 bytes
* see above, Rust was not allowed to inline, C# got a small bonus.
Of course making broad conclusions from just one micro-benchmark is not really supportable. But we can get a sense of what code patterns typically look like. It’s fair to say that raw code size for C# will be quite a bit bigger for these kinds of patterns and, generally, Rust will lower more favorably because of saved write-barriers and more stack usage.
It’s not hard to imagine what typical object assignments would look like using the patterns in the above. But again, reaching too far is not recommend.