Performance Improvements in .NET 8
This is a summary of the excellent and lengthy document by Stephen Toub
In the interest of making it easier to find the original source material I have (nearly) reproduced the original outline with links to the original document and I will add some summary notes in each section. Keep in mind that this is somewhat of an opinion piece at this point because naturally I’m going to talk about the things that I think are the most exciting from the lens of the what I’m working on at the moment. YMMV and I don’t want people to think their work is unimportant just because it doesn’t happen to align with what I happen to think is important at this particular moment.
Also, importantly, in the interest of space I am not citing the benchmarks in the original data. Most of these gains have some quantification so click through to the base document for more information.
Note: There is always some chance I’m misunderstood something in some part of the original document. If in doubt, click through. The original is of course going to be authoritative.
JIT
Jit investments are enormous in .NET 8 and broadly you can think of them as doing the kinds of things you would expect a high-quality dynamic language jitter to do (e.g. Javascript). Aggressive de-virtualization allows huge simplifications in codegen on the hot branches. By itself this technology could justify moving to .NET for any server workload. It’s less interesting for code that will only ever run once.
Tiering and Dynamic PGO
This is where the big magic is happening. Long running methods will get On Stack Replacement (yes swapping the code out while it’s still running). Second recompilation can assume static readonly
has been initialized and so it's now a constant for the JIT. Dynamic Profile Guided Optimization is on by default, the first optimized version generated includes instrumentation and it will be regenerated after sampling (reservoir style) to make inferences about types that appear commonly in the execution. Type-based de-virtualization can (and does) result in inlining when possible drastically reducing the costs of interfaces and delegates. Just the slick tricks used to count calls with low interlock overhead are worth a paper to discuss but the results here can be staggering. Especially because general methods you didn't write can benefit from optimizations based on how they are being used in your application. This has the possibility of being better than you can do with static PGO and a great perf lab regardless of how good your ahead of time compilation tech is. Note that often only a small fraction of your code need to get this level of optimization as the most common case is still that methods run zero or one times. The JIT also uses static analysis to guess which paths are likely to be hot. Even the old version of NGEN
used to do this (e.g., paths that throw are cold). Note also that de-virtualization checks can be hoisted from loops and with loop cloning you can get a fast path loop and a normal path loop.
Vectorization
This has been an ongoing theme in .NET, even as far back as .NET Core 3.0. .NET Core 8 has thousands of intrinsics for operating on Vector128<T>
, Vector256<T>
, and Vector512<T>
. These are hardware accelerated where possible, but still portable. These operations are especially of interest in the computation of complex hashes. Note that ARM64 has different accelerations than X64 but they are available on ARM.
Branching
Unnecessary branches at best contaminate the branch prediction cache and at worst are predicted poorly. Removing them is always better. .NET 8 includes several new facilities for removing unneeded branches. Use of standard helper functions with argument guards often resulted in several levels of checks as each callee checks for safety because it can’t assume the caller has checked. When inlined, this results in redundant branches and dead code. Many such cases are removed now.
Additionally, some cases are folded, e.g., if (x >= 0 && y >= 0)
can be safely converted into if ((x|y) >= 0)
.
Finally, in many cases branches are eliminated entirely using conditional move instructions and the like. Folding both paths into a predicated operation. The cmov
and csel
patterns are (dare I say it) universally more economical than branching. In .NET various if
patterns are morphed into conditional instructions. E.g. max
can be done with no branches, it generates compare and conditional move if greater even if you write it with the usual ?:
.
Bounds Checking
Of classic importance, some new tricks here include eliding bounds checks after a mod (%
) operation, common in hash tables. As well as elision in reverse subscript cases x[-1]
— if the array is already known to be length 1, this can be elided, too, and that’s new.
Constant Folding
Not so much constant folding as constant propagation improvements (propagation means in x = 3 + 4; y = x + 5;
y
is computable). In the face of improved de-virtualization, and inlining, it becomes interesting to flow constants from the call site, even string literals, and then flow them, even through if
chains or switch
statements. With per call site de-virtualization many such opportunities have been opened up.
Non-GC Heap
This is kind of a return to my old baby Frozen Objects. .NET 8 has a heap segment that never goes away, and it can put string literals into it. We did this years ago for ngen
and had to abandon it because of ASLR but now this trick is back in the JIT so it can emit constant addresses for string literals instead of loading a handle. Also, the GC doesn't have to walk the segment that's frozen so that's a side benefit. There are some similar constant objects such as RuntimeType
and even Array.Empty<T>
.
Another place this can be used is for static value types that are free of GC references. And, such objects can also have the write barrier removed when their address is stored for more gains.
Zeroing
If your function needs a lot of local space zeroing it out can be expensive. .NET can use vector instructions in an optimized memset
to do the zeroing in addition to the loop method currently used. It uses some tricksy methods to get fewer instructions, e.g., if you're zeroing 224 bytes 128 bytes at a time you only have 96 bytes left to zero after the first write. It's cheaper to overlap the second write, so that 32 bytes are zeroed twice, and you only need two vector writes.
Value Types
Since .NET 7, value types could be split into their components into equivalent locals. This was significantly generalized in .NET 8 meaning that if you, for instance, copy a struct and then do some computations on some of the fields, you can get locals for just the fields you copied and compute using those locals possibly never copying the struct at all. Highly valuable for a big struct, or a struct with reference types in it. As it turns out simple enumerator structs get a benefit from this optimization.
Casting
Some improvements when casting to or checking type of sealed types and sealed array types. You don’t need a helper call because you can check for exact match.
Peephole Optimizations
There’s quite a list of new peeps. My favorite is multiplying by a number that’s close to a power of 2 with move
, shift
, add
instead of mul
. There's a long list of these.
Native AOT
ASP.NET can be compiled with Native AOT (this used to be called .NET Native?). You can build a JIT free standalone application this way. Hello World dropped from 13M to 1.5M. This is a long series of improvements. And of course adding framework overhead to something as small as Hello World is dumb, but this gives you a sense of what the floor is now.
VM
Many improvements here including optimizations that help delegate dispatch (anything that uses the MethodDesc
really). Improvements in the allocator for executable sections. Plus some changes to improve metadata lookup that are good for startup time.
GC
Server GC can have a dynamic heap count, allowing it to increase or decrease the number of dedicated threads for working on a heap (it used to be 1:1 with cores). It can dynamically adapt to application size and adjust heap overhead and parallelism.
Mono
Mono can target other runtimes, like WASM — with AOT or JIT. In .NET 8 Mono introduces a hybrid Jit/Interpreted mode for this (“jiterpreter”). Blazor Web Assembly projects significantly benefit from this. Note WASM can run in lots of places not just on the web (e.g., Node.js). Mono also added support for vectorization Vector128<T> and various supporting functions.
Separately, Mono wants to use native support for internationalization which is usually present in say Javascript in the host, rather than shipping it’s own copy of the ICU libraries. This is an opt-in feature now.
Threading
Mostly incremental work in this area. This was a focus area of .NET 6 and 7.
ThreadStatic
Thread local storage is most commonly done by applying [ThreadStatic] to a static variable. In .NET this required a helper call on access. In .NET 8 this can be inlined in many cases resulting in tighter code. Especially important for (e.g.) thread local integers.
ThreadPool
Native AOT projects on Windows have the option of using the Portable Thread Pool or the Windows Thread Pool wrapper. The latter can be quite helpful if there is already Thread Pool activity in other parts of the application.
Tasks
Variety of improvements both for cases where the task synchronously completes and when it doesn’t. Task
and Task<TResult>
both try to give back cached Task
objects. Task<bool>
can always return a cached object for true
or false
. .NET 8 includes assorted cases where a cached value can be used, such as default values. Other commonly used value types that are often zeroed or mostly zeroed can get the same treatment as primitives. These smallish types are bitwise indistinguishable from a stored primitive value. This helps task types like TimeSpan
, DateTime
, Guid
, and others.
There are many improvements in this area that help with scheduling and overhead. However, my favorite new feature is the use of the System.TimeProvider
abstract class in the new code. This lets you swap in a fake timer so you can test situations like 1 hour timeouts without having to fake the whole task system.
Parallel
In .NET 8 we get Parallel.ForAsync
. This saves you from having to create temporary objects like an Enumerable.Range
to iterate using .ForEachAsync
Exceptions
.NET 8 adds many new “throw exception helpers” ThrowIfNull
is the classic example but there are many choices now (e.g., ThrowIfGreaterThan
and friends). These help by creating an inline-able helper to do the check and shared code to do the throw.
Reflection
Several improvements to get back space but the most interesting is where MethodBase.Invoke
, the dispatch worker of reflection, improves the codegen of emitted method calls. Further, repeated invocation can be accelerated with MethodInvoker
and ConstructorInvoker
which save the lookup results for the MethodDesc
rather than computing it on the fly every time. Super helpful for repeated invocations. Kind of like holding on to a parsed regex rather than re-parsing, only lower level.
Primitives
Enums
Enum was changed so that its underlying implementation stores an array of the specified enum type (almost always int
) rather than the worst case ulong
(2x as big).
Enum’s ToString
and IsDefined
were improved for dense enums with values 0..n
allowing simple array lookup for value names rather than a hash table.
Numbers
Many number formatting improvements, such as writing numbers two digits at a time to get less division. Also precomputed formats for common numbers like 0–299 that can be shared (enough for every successful HTTP code). Numbers can format directly in to UTF8 spans, which turns out to be a big deal because this can give you allocation free result construction for many domains. TryFormat
is much more general when it comes to targeting Span
as output which is invaluable.
DateTime
DateTime
gets dedicated routines for the most popular formats rather than one routine for all. This is a 5x improvement in some cases so it’s not nothing. Symmetrically, parsing improvements allow date scanning to proceed more quickly, including a slick “think of the month digits like they were a number” trick to map month strings to a number quickly. In this same area there were improvements in TimeZoneInfo
caching for Mac and Linux.
Guid
UTF8 formatting improvements mentioned above also went into GUID.
Random
Less wasted work replacing expansive modulo operation (%
) with cheaper multiplication and shift plus rejection if out of range. It turns out division still sucks in 2023.
Strings, Arrays, and Spans
UTF8
IUtf8SpanFormattable
is on a lot of types, numbers of course, but also IPAddress
, IPNetwork
. This plus Format
allows formatting or string interpolation ($"message {var:fmt}"
) directly into a Span
. As the span could be stack backed or backed by a byte slice this means formatting into the most common form on the internet with no allocations. Huzzah.
ASCII
Similar to the above, ASCII spans can be targeted.
Base64
Several improvements, including better handling of whitespace and vectorized base64 encoding.
Hex
Several improvements including vectorization of formatting.
String Formatting
CompositeFormat
lets you "precompile" your format string even if it isn't known at compile time. You can then use that object instead of a format string to get faster formatting with less parsing overhead. These can appear pretty much anywhere a format string could appear. Note that “normal” interpolated strings get this sort of treatment automatically because their format is known at compile time.
Spans
In addition to lots more things targeting spans, Span
also gets vectorized Count
and Replace
methods. Very common operations. This helps out other classes like StringBuilder
. Also MemoryExtensions.Split
gives an allocation-free method of splitting a string into a fixed number of spans. This happens all the time... There are many vectorization helpers in .NET 8 even String.IndexOf
gets love. And MemoryMarshal
is there to help march through these kinds of data types with economy.
Span in all the places is a big theme in this release.
SearchValues
SearchValues
gives you something kind of like the compiled regex pattern to use to create a vectorized high speed bloom search for a search for any of some possibly large set of values on an input span. This can be much faster than IndexOfAny
and the computation to study the search values is done only once. This stuff comes up surprisingly often e.g. JSON had its own IndexOfQuoteOrAnyControlOrBackSlash
. It's also used by Regex
(more later).
Regex
The compiled regex matching can use SearchValues
to find valid starting characters more quickly. E.g. if you need something like looks like a zipcode GeneratedRegex(@"[0-9]{5}")
then you'd like to start on a digit. This can become: int indexOfPos = span.Slice(i).IndexOfAnyInRange('0', '9')
which gets all the vectorization goodness. A similar approach can be used if the possible last characters of the regex are known.
There are quite a few other new tricks in this area.
Hashing
Some new non-cryptographic hashes were added including several from the popular XXH
family. The new hashes XxHash3
and XxHash128
can be vectorized and so are of great interest. Some of the older CRC algorithms (Crc32
and Crc64
) were also vectorized based on an old Intel paper. This can result in huge gains compared to .NET 7.
Initialization
static ReadOnlySpan<T>
was generalized and work pushed into Mono as well allowing more types to be recognized as immutable blobs of data. This means they do not have to go on the heap as there is no way to get back anything like an array object from the span. So instead of making an object the compiler can store bytes in the text segment. Much cheaper. This was widely generalized to more types transparently in .NET 8. The upside is constant arrays are cheaper and they crop up all the time.
Also of note, stack into a span Span<byte> buffer = stackalloc byte[8];
is a safe heap-free allocation. The C# [InlineArray]
attribute further generalizes this to allow you to get an array of any value-only struct on the stack directly into a Span
with no unsafe code needed.
Collections
General
As discussed above, Empty
on all collections is special cased to avoid allocations. Likewise, if you ask for an enumerator on empty collections you get a shared empty enumerator.
List
List
significantly improves AddRange
to fallback to Add
rather than InsertRange
thereby preserving inlining and avoiding more checks. List also gets SetCount
to increase its length without setting values allowing the values to then be set by vectorized span writers. Span<char> span = CollectionsMarshal.AsSpan(list);
can get you spans for lists and span.Fill
can initialize a span.
LINQ
LINQ can use some of the above features to create its output objects. Likewise, SelectToList
and RangeSelectToList
can write directly into spans. RepeatToList
can use the fill pattern above. Range(1..100).ToList()
benefits from writing directly to a Span
and from vectorization. There are other vectorizations such as Enumerable.Sum
.
In .NET 8, the Order
and OrderDescending
operators are using a stable sort, which makes them useful in series.
Dictionary
LINQ gained ToDictionary
overloads. These are delegate free so collection.ToDictionary()
is there for data already in pairs. Guard clauses on dictionaries have been improved with TryAdd
, avoiding a double lookup.
Frozen Collections
These are useful for making collections that never change once loaded (not to be confused with immutable collections which logically create a new collection as a result of attempting to mutate). There are several internal formats, and the classes can pick an implementation based on the supplied data — which can’t change. This can give significant improvements in lookup speed and density.
Immutable Collections
System.Runtime.InteropServices.ImmutableCollectionsMarshal
provides something non-brittle for efficiently extracting data from an immutable collection. Construction of these can also proceed from a read-only span, allowing creation with no allocation overhead for the arguments.
Again, Span
in all the things is a common theme in .NET 8.
BitArray
Adds vectorized HasAllSet
and HasAnySet
.
Collection Expressions
Lots of new constructor formats that are easier to use and potentially less code because of clear intent. e.g. List<int> list = {1,2,3}
can be optimized better than list = new List<int>
then ...Add(1), Add(2), Add(3)
. The compiler is free to use spans and constant spans as mentioned above to insert the items in bulk.
File I/O
There are many changes in this area, everything from async handling to improvements in File.Copy
.
Networking
Networking Primitives
Improved performance on IPAddress
storage. IPv6 is the perfect size for a single 128 byte vector instruction copy and moving the address from an array to a span allows vector copies. Likewise endianness can be fixed with vector instructions. As mentioned IPAddress
formatting supports Span
targets for big memory savings.
Sockets
Buffers passed to the socket methods get less aggressive pinning. GCHandle
only needs to be held during the call. Various improvements in the UDP stack reduced the amount of allocations required on most calls. Plus Span
friendly send and receive functions were added to avoid having to allocate byte arrays just to send information you already have in perfectly good arrays. Plus these APIs work on a SocketAddress
directly which removed the overhead converting from EndPoint
that was in the old API.
TLS
SslStream
gets reduced allocations, some of which looked pretty big. Plus several other smaller things.
HTTP
Lots of changes in HttpClient
including making use of some of the lower level things discussed above for better parsing of HTTP. There are a dozen or so changes in this area each of which gives some benefit but really the biggest benefits will come from payload creation and parsing using the new Span
things.
JSON
JSON serializers end up in a lot of critical places. The output was tuned to be of high quality in Native AOT builds where it is often found. Like many other cases a serializer can be invoked to write JSON directly into a SPAN. Plus the generated source code benefits from the constant list optimizations mentioned previously.
Cryptography
.NET 8 switches RSA ephemeral operations over to bcrypt.dll
rather than ncrypt.dll
avoiding an RPC to lsass.exe
. Previously ncrypt.dll
was used for both persisted and ephemeral operations because it can do both. The results are significant. The cases that could not changed still saved some RPC calls by caching invariants like key-length. Some improvements to AsnReader
and AsnWriter
were added so that they know about the most common object identifiers (OIDs) which makes them faster. Interestingly some of the new List
patterns helped with this.
Logging
In many cases (ASP.NET) Loggers
are cached so that LoggerFactor.CreateLogger
can return an existing logger. However, there is much contention on the cache. This was changed to use ConcurrentDictionary<TKey,TValue>
which is lock-free. There were several places this could be done in other parts of the framework. Logging also benefited from CompositeFormat
above which reduced allocations in may paths.
Configuration
There is a new source generator for configuration in .NET 8 which avoids expensive reflection costs replacing them with custom binding based on examination of the shapes at build time. This can result in drastic reductions because reflection brings in heaps of otherwise cold metadata.
It’s accomplishing this with some new magic that allows it to replace the general Bind
call at build time -- which would be a treatise in itself but that's C# interceptors. The net of all of this is that a ton of dynamic information can be pre-computed leaving nothing to guess at run time.
Conclusion
I’ve editorialized enough already. Your conclusions are going to vary based on your needs but hopefully this summary of the summary will be helpful to you.