Performance Improvements in .NET 8

Rico Mariani
14 min readSep 25, 2023

--

This is a summary of the excellent and lengthy document by Stephen Toub

In the interest of making it easier to find the original source material I have (nearly) reproduced the original outline with links to the original document and I will add some summary notes in each section. Keep in mind that this is somewhat of an opinion piece at this point because naturally I’m going to talk about the things that I think are the most exciting from the lens of the what I’m working on at the moment. YMMV and I don’t want people to think their work is unimportant just because it doesn’t happen to align with what I happen to think is important at this particular moment.

Also, importantly, in the interest of space I am not citing the benchmarks in the original data. Most of these gains have some quantification so click through to the base document for more information.

Note: There is always some chance I’m misunderstood something in some part of the original document. If in doubt, click through. The original is of course going to be authoritative.

JIT

Jit investments are enormous in .NET 8 and broadly you can think of them as doing the kinds of things you would expect a high-quality dynamic language jitter to do (e.g. Javascript). Aggressive de-virtualization allows huge simplifications in codegen on the hot branches. By itself this technology could justify moving to .NET for any server workload. It’s less interesting for code that will only ever run once.

Tiering and Dynamic PGO

This is where the big magic is happening. Long running methods will get On Stack Replacement (yes swapping the code out while it’s still running). Second recompilation can assume static readonly has been initialized and so it's now a constant for the JIT. Dynamic Profile Guided Optimization is on by default, the first optimized version generated includes instrumentation and it will be regenerated after sampling (reservoir style) to make inferences about types that appear commonly in the execution. Type-based de-virtualization can (and does) result in inlining when possible drastically reducing the costs of interfaces and delegates. Just the slick tricks used to count calls with low interlock overhead are worth a paper to discuss but the results here can be staggering. Especially because general methods you didn't write can benefit from optimizations based on how they are being used in your application. This has the possibility of being better than you can do with static PGO and a great perf lab regardless of how good your ahead of time compilation tech is. Note that often only a small fraction of your code need to get this level of optimization as the most common case is still that methods run zero or one times. The JIT also uses static analysis to guess which paths are likely to be hot. Even the old version of NGEN used to do this (e.g., paths that throw are cold). Note also that de-virtualization checks can be hoisted from loops and with loop cloning you can get a fast path loop and a normal path loop.

Vectorization

This has been an ongoing theme in .NET, even as far back as .NET Core 3.0. .NET Core 8 has thousands of intrinsics for operating on Vector128<T>, Vector256<T>, and Vector512<T>. These are hardware accelerated where possible, but still portable. These operations are especially of interest in the computation of complex hashes. Note that ARM64 has different accelerations than X64 but they are available on ARM.

Branching

Unnecessary branches at best contaminate the branch prediction cache and at worst are predicted poorly. Removing them is always better. .NET 8 includes several new facilities for removing unneeded branches. Use of standard helper functions with argument guards often resulted in several levels of checks as each callee checks for safety because it can’t assume the caller has checked. When inlined, this results in redundant branches and dead code. Many such cases are removed now.

Additionally, some cases are folded, e.g., if (x >= 0 && y >= 0) can be safely converted into if ((x|y) >= 0).

Finally, in many cases branches are eliminated entirely using conditional move instructions and the like. Folding both paths into a predicated operation. The cmov and csel patterns are (dare I say it) universally more economical than branching. In .NET various if patterns are morphed into conditional instructions. E.g. max can be done with no branches, it generates compare and conditional move if greater even if you write it with the usual ?:.

Bounds Checking

Of classic importance, some new tricks here include eliding bounds checks after a mod (%) operation, common in hash tables. As well as elision in reverse subscript cases x[-1] — if the array is already known to be length 1, this can be elided, too, and that’s new.

Constant Folding

Not so much constant folding as constant propagation improvements (propagation means in x = 3 + 4; y = x + 5; y is computable). In the face of improved de-virtualization, and inlining, it becomes interesting to flow constants from the call site, even string literals, and then flow them, even through if chains or switch statements. With per call site de-virtualization many such opportunities have been opened up.

Non-GC Heap

This is kind of a return to my old baby Frozen Objects. .NET 8 has a heap segment that never goes away, and it can put string literals into it. We did this years ago for ngen and had to abandon it because of ASLR but now this trick is back in the JIT so it can emit constant addresses for string literals instead of loading a handle. Also, the GC doesn't have to walk the segment that's frozen so that's a side benefit. There are some similar constant objects such as RuntimeType and even Array.Empty<T>.

Another place this can be used is for static value types that are free of GC references. And, such objects can also have the write barrier removed when their address is stored for more gains.

Zeroing

If your function needs a lot of local space zeroing it out can be expensive. .NET can use vector instructions in an optimized memset to do the zeroing in addition to the loop method currently used. It uses some tricksy methods to get fewer instructions, e.g., if you're zeroing 224 bytes 128 bytes at a time you only have 96 bytes left to zero after the first write. It's cheaper to overlap the second write, so that 32 bytes are zeroed twice, and you only need two vector writes.

Value Types

Since .NET 7, value types could be split into their components into equivalent locals. This was significantly generalized in .NET 8 meaning that if you, for instance, copy a struct and then do some computations on some of the fields, you can get locals for just the fields you copied and compute using those locals possibly never copying the struct at all. Highly valuable for a big struct, or a struct with reference types in it. As it turns out simple enumerator structs get a benefit from this optimization.

Casting

Some improvements when casting to or checking type of sealed types and sealed array types. You don’t need a helper call because you can check for exact match.

Peephole Optimizations

There’s quite a list of new peeps. My favorite is multiplying by a number that’s close to a power of 2 with move, shift, add instead of mul. There's a long list of these.

Native AOT

ASP.NET can be compiled with Native AOT (this used to be called .NET Native?). You can build a JIT free standalone application this way. Hello World dropped from 13M to 1.5M. This is a long series of improvements. And of course adding framework overhead to something as small as Hello World is dumb, but this gives you a sense of what the floor is now.

VM

Many improvements here including optimizations that help delegate dispatch (anything that uses the MethodDesc really). Improvements in the allocator for executable sections. Plus some changes to improve metadata lookup that are good for startup time.

GC

Server GC can have a dynamic heap count, allowing it to increase or decrease the number of dedicated threads for working on a heap (it used to be 1:1 with cores). It can dynamically adapt to application size and adjust heap overhead and parallelism.

Mono

Mono can target other runtimes, like WASM — with AOT or JIT. In .NET 8 Mono introduces a hybrid Jit/Interpreted mode for this (“jiterpreter”). Blazor Web Assembly projects significantly benefit from this. Note WASM can run in lots of places not just on the web (e.g., Node.js). Mono also added support for vectorization Vector128<T> and various supporting functions.

Separately, Mono wants to use native support for internationalization which is usually present in say Javascript in the host, rather than shipping it’s own copy of the ICU libraries. This is an opt-in feature now.

Threading

Mostly incremental work in this area. This was a focus area of .NET 6 and 7.

ThreadStatic

Thread local storage is most commonly done by applying [ThreadStatic] to a static variable. In .NET this required a helper call on access. In .NET 8 this can be inlined in many cases resulting in tighter code. Especially important for (e.g.) thread local integers.

ThreadPool

Native AOT projects on Windows have the option of using the Portable Thread Pool or the Windows Thread Pool wrapper. The latter can be quite helpful if there is already Thread Pool activity in other parts of the application.

Tasks

Variety of improvements both for cases where the task synchronously completes and when it doesn’t. Task and Task<TResult> both try to give back cached Task objects. Task<bool> can always return a cached object for true or false. .NET 8 includes assorted cases where a cached value can be used, such as default values. Other commonly used value types that are often zeroed or mostly zeroed can get the same treatment as primitives. These smallish types are bitwise indistinguishable from a stored primitive value. This helps task types like TimeSpan, DateTime, Guid, and others.

There are many improvements in this area that help with scheduling and overhead. However, my favorite new feature is the use of the System.TimeProvider abstract class in the new code. This lets you swap in a fake timer so you can test situations like 1 hour timeouts without having to fake the whole task system.

Parallel

In .NET 8 we get Parallel.ForAsync. This saves you from having to create temporary objects like an Enumerable.Range to iterate using .ForEachAsync

Exceptions

.NET 8 adds many new “throw exception helpers” ThrowIfNull is the classic example but there are many choices now (e.g., ThrowIfGreaterThan and friends). These help by creating an inline-able helper to do the check and shared code to do the throw.

Reflection

Several improvements to get back space but the most interesting is where MethodBase.Invoke, the dispatch worker of reflection, improves the codegen of emitted method calls. Further, repeated invocation can be accelerated with MethodInvoker and ConstructorInvoker which save the lookup results for the MethodDesc rather than computing it on the fly every time. Super helpful for repeated invocations. Kind of like holding on to a parsed regex rather than re-parsing, only lower level.

Primitives

Enums

Enum was changed so that its underlying implementation stores an array of the specified enum type (almost always int) rather than the worst case ulong (2x as big).

Enum’s ToString and IsDefined were improved for dense enums with values 0..nallowing simple array lookup for value names rather than a hash table.

Numbers

Many number formatting improvements, such as writing numbers two digits at a time to get less division. Also precomputed formats for common numbers like 0–299 that can be shared (enough for every successful HTTP code). Numbers can format directly in to UTF8 spans, which turns out to be a big deal because this can give you allocation free result construction for many domains. TryFormat is much more general when it comes to targeting Span as output which is invaluable.

DateTime

DateTime gets dedicated routines for the most popular formats rather than one routine for all. This is a 5x improvement in some cases so it’s not nothing. Symmetrically, parsing improvements allow date scanning to proceed more quickly, including a slick “think of the month digits like they were a number” trick to map month strings to a number quickly. In this same area there were improvements in TimeZoneInfo caching for Mac and Linux.

Guid

UTF8 formatting improvements mentioned above also went into GUID.

Random

Less wasted work replacing expansive modulo operation (%) with cheaper multiplication and shift plus rejection if out of range. It turns out division still sucks in 2023.

Strings, Arrays, and Spans

UTF8

IUtf8SpanFormattable is on a lot of types, numbers of course, but also IPAddress, IPNetwork. This plus Format allows formatting or string interpolation ($"message {var:fmt}") directly into a Span. As the span could be stack backed or backed by a byte slice this means formatting into the most common form on the internet with no allocations. Huzzah.

ASCII

Similar to the above, ASCII spans can be targeted.

Base64

Several improvements, including better handling of whitespace and vectorized base64 encoding.

Hex

Several improvements including vectorization of formatting.

String Formatting

CompositeFormat lets you "precompile" your format string even if it isn't known at compile time. You can then use that object instead of a format string to get faster formatting with less parsing overhead. These can appear pretty much anywhere a format string could appear. Note that “normal” interpolated strings get this sort of treatment automatically because their format is known at compile time.

Spans

In addition to lots more things targeting spans, Span also gets vectorized Count and Replace methods. Very common operations. This helps out other classes like StringBuilder. Also MemoryExtensions.Split gives an allocation-free method of splitting a string into a fixed number of spans. This happens all the time... There are many vectorization helpers in .NET 8 even String.IndexOf gets love. And MemoryMarshal is there to help march through these kinds of data types with economy.

Span in all the places is a big theme in this release.

SearchValues

SearchValues gives you something kind of like the compiled regex pattern to use to create a vectorized high speed bloom search for a search for any of some possibly large set of values on an input span. This can be much faster than IndexOfAny and the computation to study the search values is done only once. This stuff comes up surprisingly often e.g. JSON had its own IndexOfQuoteOrAnyControlOrBackSlash. It's also used by Regex (more later).

Regex

The compiled regex matching can use SearchValues to find valid starting characters more quickly. E.g. if you need something like looks like a zipcode GeneratedRegex(@"[0-9]{5}") then you'd like to start on a digit. This can become: int indexOfPos = span.Slice(i).IndexOfAnyInRange('0', '9') which gets all the vectorization goodness. A similar approach can be used if the possible last characters of the regex are known.

There are quite a few other new tricks in this area.

Hashing

Some new non-cryptographic hashes were added including several from the popular XXH family. The new hashes XxHash3 and XxHash128 can be vectorized and so are of great interest. Some of the older CRC algorithms (Crc32 and Crc64) were also vectorized based on an old Intel paper. This can result in huge gains compared to .NET 7.

Initialization

static ReadOnlySpan<T> was generalized and work pushed into Mono as well allowing more types to be recognized as immutable blobs of data. This means they do not have to go on the heap as there is no way to get back anything like an array object from the span. So instead of making an object the compiler can store bytes in the text segment. Much cheaper. This was widely generalized to more types transparently in .NET 8. The upside is constant arrays are cheaper and they crop up all the time.

Also of note, stack into a span Span<byte> buffer = stackalloc byte[8]; is a safe heap-free allocation. The C# [InlineArray] attribute further generalizes this to allow you to get an array of any value-only struct on the stack directly into a Span with no unsafe code needed.

Collections

General

As discussed above, Empty on all collections is special cased to avoid allocations. Likewise, if you ask for an enumerator on empty collections you get a shared empty enumerator.

List

List significantly improves AddRange to fallback to Add rather than InsertRange thereby preserving inlining and avoiding more checks. List also gets SetCount to increase its length without setting values allowing the values to then be set by vectorized span writers. Span<char> span = CollectionsMarshal.AsSpan(list); can get you spans for lists and span.Fill can initialize a span.

LINQ

LINQ can use some of the above features to create its output objects. Likewise, SelectToList and RangeSelectToList can write directly into spans. RepeatToList can use the fill pattern above. Range(1..100).ToList() benefits from writing directly to a Span and from vectorization. There are other vectorizations such as Enumerable.Sum.

In .NET 8, the Order and OrderDescending operators are using a stable sort, which makes them useful in series.

Dictionary

LINQ gained ToDictionary overloads. These are delegate free so collection.ToDictionary() is there for data already in pairs. Guard clauses on dictionaries have been improved with TryAdd, avoiding a double lookup.

Frozen Collections

These are useful for making collections that never change once loaded (not to be confused with immutable collections which logically create a new collection as a result of attempting to mutate). There are several internal formats, and the classes can pick an implementation based on the supplied data — which can’t change. This can give significant improvements in lookup speed and density.

Immutable Collections

System.Runtime.InteropServices.ImmutableCollectionsMarshal provides something non-brittle for efficiently extracting data from an immutable collection. Construction of these can also proceed from a read-only span, allowing creation with no allocation overhead for the arguments.

Again, Span in all the things is a common theme in .NET 8.

BitArray

Adds vectorized HasAllSet and HasAnySet.

Collection Expressions

Lots of new constructor formats that are easier to use and potentially less code because of clear intent. e.g. List<int> list = {1,2,3} can be optimized better than list = new List<int> then ...Add(1), Add(2), Add(3). The compiler is free to use spans and constant spans as mentioned above to insert the items in bulk.

File I/O

There are many changes in this area, everything from async handling to improvements in File.Copy.

Networking

Networking Primitives

Improved performance on IPAddress storage. IPv6 is the perfect size for a single 128 byte vector instruction copy and moving the address from an array to a span allows vector copies. Likewise endianness can be fixed with vector instructions. As mentioned IPAddress formatting supports Span targets for big memory savings.

Sockets

Buffers passed to the socket methods get less aggressive pinning. GCHandle only needs to be held during the call. Various improvements in the UDP stack reduced the amount of allocations required on most calls. Plus Span friendly send and receive functions were added to avoid having to allocate byte arrays just to send information you already have in perfectly good arrays. Plus these APIs work on a SocketAddress directly which removed the overhead converting from EndPoint that was in the old API.

TLS

SslStream gets reduced allocations, some of which looked pretty big. Plus several other smaller things.

HTTP

Lots of changes in HttpClient including making use of some of the lower level things discussed above for better parsing of HTTP. There are a dozen or so changes in this area each of which gives some benefit but really the biggest benefits will come from payload creation and parsing using the new Span things.

JSON

JSON serializers end up in a lot of critical places. The output was tuned to be of high quality in Native AOT builds where it is often found. Like many other cases a serializer can be invoked to write JSON directly into a SPAN. Plus the generated source code benefits from the constant list optimizations mentioned previously.

Cryptography

.NET 8 switches RSA ephemeral operations over to bcrypt.dll rather than ncrypt.dll avoiding an RPC to lsass.exe. Previously ncrypt.dll was used for both persisted and ephemeral operations because it can do both. The results are significant. The cases that could not changed still saved some RPC calls by caching invariants like key-length. Some improvements to AsnReader and AsnWriter were added so that they know about the most common object identifiers (OIDs) which makes them faster. Interestingly some of the new List patterns helped with this.

Logging

In many cases (ASP.NET) Loggers are cached so that LoggerFactor.CreateLogger can return an existing logger. However, there is much contention on the cache. This was changed to use ConcurrentDictionary<TKey,TValue> which is lock-free. There were several places this could be done in other parts of the framework. Logging also benefited from CompositeFormat above which reduced allocations in may paths.

Configuration

There is a new source generator for configuration in .NET 8 which avoids expensive reflection costs replacing them with custom binding based on examination of the shapes at build time. This can result in drastic reductions because reflection brings in heaps of otherwise cold metadata.

It’s accomplishing this with some new magic that allows it to replace the general Bind call at build time -- which would be a treatise in itself but that's C# interceptors. The net of all of this is that a ton of dynamic information can be pre-computed leaving nothing to guess at run time.

Conclusion

I’ve editorialized enough already. Your conclusions are going to vary based on your needs but hopefully this summary of the summary will be helpful to you.

--

--

Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.