C# Lambdas: A Code Teardown

6 min readAug 31, 2023

Continuing in my series of teardowns, I took some time to do an assembly language teardown of some C# code that uses lambdas in a very simple way. The disassembled code below is a release build for Windows, so it uses the Windows ABI.

The complete source code for the test case is as follows:

using System;
class MyProgram
{
    static void call_functor(Action<int> func)
    {
        func(1000);
    }

    static void functors_test()
    {
        int a = 1;
        var x = (int x) =>
        {
            Console.WriteLine("f1 {0}", a + x);
        };
        call_functor(x);
    }

    static void Main(string[] args)
    {
        functors_test();
    }
}

Now we know that .NET binaries include significant overhead for garbage collector information, for exception handling and for class metadeta. This is magnified for toy examples so one should take the percentage of overhead here with a large grain of salt. However, the code quality situation and the heap situation is pretty representative of what goes on in .NET more generally and there are no array bounds checks here. A helper function is used which hides the cost of the write-barrier code but this is fair in that the cost to any given user is a call instead of a store — at least from a size perspective.

Analysis

Let’s dig into this code, we’ll look at ‘Main’ first. It doesn’t play a part in the comparison because in the rust example the equivalent of functors_test was main (i.e. there was no helper). But still it's easy and it's a place to start.

// this is the code for main, it's 15 bytes
    static void Main(string[] args)
    {
        functors_test();

// this the standard preamble for a standard static function 
// note that it reserves the extra home storage space on the
// for callees per the Windows ABI
00000F7CF0  sub         rsp,28h  
00000F7CF4  call        CLRStub[MethodDescPrestub]@0de3e8 (00DE3E8h)  

// I dunno why we need this, maybe alignment for EH or something...
00000F7CF9  nop  

// cleanup and return
00000F7CFA  add         rsp,28h  
00000F7CFE  ret

// assorted GC Info and other metadata stuff (33 bytes)
0xF7CFF 0xF7D20  [stuff]

OK that was simple enough, a little 15 byte baby function.

Now that we’ve got our feet wet let’s look at our first “real” function. Here is functors_test.

// this function is 72 bytes
    static void functors_test()
    {
        int a = 1;
// standard register storage and home storage reservation
00000F7D20  push        rdi  
00000F7D21  push        rsi  
00000F7D22  sub         rsp,28h  

// here we make a our shared frame object for captured values
// and store the constant 1 in it, note that the locals
// were hoisted onto the heap because in .NET lambda capture
// is always by reference
00000F7D26  mov         rcx,212AA0h  
00000F7D30  call        CORINFO_HELP_NEWSFAST
00000F7D35  mov         rsi,rax  
00000F7D38  mov         dword ptr [rsi+8],1  

// here we make our lambda object
00000F7D3F  mov         rcx,212D38h  
00000F7D49  call        CORINFO_HELP_NEWSFAST
00000F7D4E  mov         rdi,rax  

// this is where the captured state will go in the lambda object
00000F7D51  lea         rcx,[rdi+8]  

// rsi had our shared frame object we have to stash it into
// the lambda that we just made
00000F7D55  mov         rdx,rsi  
00000F7D58  call        CORINFO_HELP_ASSIGN_REF 

// now we need to stash the target address of the stub
// that is the actual body of our lambda
00000F7D5D  mov         rcx,offset CLRStub[MethodDescPrestub]@0f7770
00000F7D67  mov         qword ptr [rdi+18h],rcx  

// we have a valid lambda at this point, we can use it!
        call_functor(x);

Note that we needed two heap allocations here, one for the lambda object (remember every lambda is an anonymous class) and we needed another for the captured locals. In .NET the captured locals are hoisted into a heap object that will be shared by all lambdas defined in the function. This is how capture by reference works in .NET.

The actual call call_functor was inlined which I didn’t allow to happen in the native cases. This change gives C# a small advantage, maybe 10 bytes for the call and the saved post-amble.

What follows next is the inline version of call_functor. Note: In the native language test cases was forced to make it not inline because otherwise the native optimizations basically make the entire lambda vanish. That doesn't happen in .NET. The .NET code is slightly more realistic.

// this effectively the code for call_functor, it is 20 bytes

// we get the captured state and load it up into rcx
00000F7D6B  mov         rcx,qword ptr [rdi+8] 

// the argument is 1000 (we're calling functor(1000)), 
// this was inlined because it's always a constant here
// so the argument just flowed.
00000F7D6F  mov         edx,3E8h  

// now fetch the target of the call from the lambda
00000F7D74  mov         rax,qword ptr [rdi+18h]  

// next clean up the stack
00000F7D78  add         rsp,28h  
00000F7D7C  pop         rsi  
00000F7D7D  pop         rdi  

// and finally, tail call the actual lambda body
00000F7D7E  jmp         rax  

// 1529 bytes of overhead metadata etc. There is quite a lot in this chunk.
// This chunk probably includes overhead for other helpers that happened
// to land here in th text section.  Still, it all counts.
00000F7D81 00000F837A

So we get a 20-byte function to invoke the lambda.

Last but not least the code block associated with the lambda. Here we notice that we used Console.WriteLine which is “varargs” and so the arguments had to be boxed for consistency.

// The lambda body is 59 bytes
            Console.WriteLine("f1 {0}", a + x);
00000F8140  push        rdi  
00000F8141  push        rsi  
00000F8142  sub         rsp,28h

// This is the "this" pointer for the lambda.
00000F8146  mov         rsi,rcx  

// This is "x" the incoming argument, stash it in edi.
00000F8149  mov         edi,edx  

// we make a new object of type 165FD0h  
00000F814B  mov         rcx,165FD0h  
00000F8155  call        CORINFO_HELP_NEWSFAST

// Now add the 'a' variable field to 'x' arg and store it in edi.
00000F815A  add         edi,dword ptr [rsi+8]  

// This the computed int argument for Console.Writeline, 
// it arrives as a boxed integer.
00000F815D  mov         dword ptr [rax+8],edi  
00000F8160  mov         rdx,rax  

// This is a string handle for a string literal which we fetch
// it will be the first argument, i.e., the format string.
00000F8163  mov         rcx,26718005E78h  
00000F816D  mov         rcx,qword ptr [rcx]

// Now we clean up the stack and make the call to WriteLine.
00000F8170  add         rsp,28h  
00000F8174  pop         rsi  
00000F8175  pop         rdi  

// Tail call optimized invocation of Console.WriteLine.
00000F8176  jmp         CLRStub[MethodDescPrestub]@0f8048 (00F8048h)  

// Overhead, GCInfo etc. for this function 18 bytes
00000F817B  to 00000F818D

Comparison

So, keeping the costs above in mind, how do we fare?

Well, disregarding the metadata overheads, examining only raw code size we get 151 bytes C# vs. 70 bytes Rust. More than a factor of two. If you consider the code that runs for the helper functions and the marginal cost of the extra heap allocations it’s pretty easy to imagine that the true CPU overhead of C# vs. Rust will be more like a factor of 3. And the above is just simple idiomatic C#.

                         C#              Rust
functor_test:            72 bytes        47 bytes
call_functor:            20 bytes         9 bytes
lambda body:             59 bytes        14 bytes
total code:             151 bytes        70 bytes
extra inline savings:   ~10 bytes         0 bytes *
heap allocations:         3 allocs        0 allocs
additional overhead    1580 bytes         0 bytes
code + overhead        1650 bytes        70 bytes

* see above, Rust was not allowed to inline, C# got a small bonus.

Of course making broad conclusions from just one micro-benchmark is not really supportable. But we can get a sense of what code patterns typically look like. It’s fair to say that raw code size for C# will be quite a bit bigger for these kinds of patterns and, generally, Rust will lower more favorably because of saved write-barriers and more stack usage.

It’s not hard to imagine what typical object assignments would look like using the patterns in the above. But again, reaching too far is not recommend.

C# Lambdas: A Code Teardown

Analysis

Comparison

References

Written by Rico Mariani