Performance Improvements in .NET 9

Rico Mariani
26 min readSep 21, 2024

--

Last year I wrote a summary of Stephen Toub’s Performance Improvements in .NET 8. That was so popular we’ve decided to do it again. Like before this is going to be personally biased, at least somewhat, towards things I happen to care the most about — but that can’t be helped. Also like before, I have cross linked to Stephen’s full article which has omg way more details on each item. The structure of this summary exactly parallels Stephen’s new article: Performance Improvements in .NET 9.

[Note: there are many direct quotes from the article, and I have removed internal quotations for clarity as a matter of course. Use the title links to find the original quotes in full.]

JIT

Nobody should be surprised that this is going to be the area I am most excited about. The opportunity for code quality improvement usually is the biggest attack vector for performance and this year doesn’t disappoint. Many of the themes from .NET 8 get additional love, including the multi-stage code generation features.

Fully half of the content is JIT oriented with some cases more mainstream than others but a recurring theme here is, again, dynamic compilation. Both in terms of reacting to actual workloads — code generation for what is hot — but also by targeting available fancy processor features when they are available. It can be very challenging to do the latter and so you often get lowest-common-denominator code generation. We don’t want any of that in 2024.

PGO

“PGO is a feature that enables the JIT to profile code and use what it learns from that profiling to help it generate more efficient code based on the exact usage patterns of the application” — this is a trick well known in the Javascript world where everything is dynamic, and this approach is the only way to get decent code quality, but it works for typed languages too. The general idea is that having found important code and then observed the code for a while, the JIT can do a better job given a another try. Its important to not do this universally because in many cases the effort of trying to make better code takes longer than just running the dumb code would have.

PGO now gains the ability to track type info, so (T)obj and obj is T both have a chance to be optimized.

Additionally common integer values may be observed resulting in inline code for the most common case. Sounds not that exciting but forwarding those constants to helper methods like MemMove, CopyTo and SequenceEqual means that the best code for those methods can be generated.

Using one feature to light up other features is a common theme in .NET 9.

Tier 0

Tier 0 code is the initial, simplest, code generation. The JIT is most concerned with getting some code out there because in many cases the code only runs once and thinking a long time equals “I could have been done already”. However, there are some optimizations that are so good you just gotta do them. To wit: ArgumentNullException.ThrowIfNull is used all over the place for argument validation. In some cases there would be boxing just for asking the question is it null — bizarre since a boxed value can’t be null. The JIT special cases ThrowIfNull to avoid this now. Boxing bad. Less boxing good. Similar cases avoided boxing in CreateSpan and many async/await paths.

Loops

Loops uh loop so they are important. Downcounting loops create better assembly so rewriting upcount loops to downcount can be a win even if you need to burn another register. But even better if the loop index variable could go either way. Or if a strength reduction can give you a pointer that marches though and array and no costly shift/add for your trouble. More registers sounds bad but escape analysis can tell you that eax maybe can be used and it’s always scratch. And an extra inc might easily pay for itself — so the JIT learns some new tricks here. If you’re wondering what a strength reduction looks like maybe it converts something like this rax+rsi*4+10 into just rsi by incrementing rsi as we go along. We call it strength reduction because it reduces a multiply to an add. If we do that we don’t need the upcounting index (in rsi in this example). Better loop, less math. Is this a trick question?

Bounds Checks

Bounds checks ruin your day because (1) it’s more code (2) the check includes a branch so more branches you have to predict (3) those extra branches you didn’t need reduce the chance of correctly predicting branches that are unavoidable. So less is better. The JIT is able to handle more cases, like indexing within a span. Important because the span itself often has stop conditions for length and those are enough to validate the index and that happens all over the place. Add to this some cases where you read the array/span from the end, and you get some real goodness.

Some other items in this area: a successful bounds check can guarantee a positive index value which allows you to optimize subsequent / and % operations. Dividing by a possibly negative value needs some extra checks that can be elided.

Arm64

Stephen really condenses this bit. I’ll have to condense even more:

  • better barriers using stlur instead of dmb
  • better switches using bit test for some patterns instead of jump table
  • better conditionals using csel/csinc avoiding branches yay!
  • better multiplies by using the combined multiply and add/sub combos
  • better loads to read several locations using ldp instead of ldr

It all adds up!

ARM SVE

SVE is not quite like other SIMD instructions. “SVE, or Scalable Vector Extensions is an ISA from Arm that’s a bit different. The instructions in SVE don’t operate on a fixed size. Rather, the specification allows for them to operate on sizes from 128 bits up to 2048 bits, and the specific hardware can choose which size to use.” If that sounds complicated to you, you are not alone. A lot of what happens here is instruction selection under the hood, or on already vectorized paths in standard helper methods. Much of the SVE support is tied to internal use of Vector<T> and even so “designing and enabling the SVE support is a monstrous, multi-year effort, and while the support is functional and folks are encouraged to take it for a spin, it’s not yet baked enough for us to be 100% confident the shape won’t need to evolve” so it’s marked [Experimental] meaning you can get breaking changes in the future.

AVX10.1

Doing just one new instruction set would be boring. So “.NET 9 now also supports AVX10.1 (AVX10 version 1). AVX10.1 provides everything AVX512 provides, all of the base support, the updated encodings, support for embedded broadcasts, masking, and so on, but it only requires 256-bit support in the hardware.” I have to say I’m puzzled by the fact that “at the time of this writing, there aren’t actually any chips on the market that support AVX10.1, but they’re expected in the foreseeable future.” How did they test it?

AVX512

At last we get to the stalwart and available AVX512, it gets “broad support” in .NET 9. Stephen goes into elementary use cases where even very simple operations like bulk zeroing can benefit from vector instructions under the hood. My favorite use of vectoring is under the hood where important stuff that happens all the time just gets magically faster. The article reports that “vmovdqu32 (move unaligned packed doubleword integer values) can be used to zero twice as much at a time (64 bytes) as vmovdqa (move aligned packed integer values).” I like zeroing twice as much at a time because it shows up in lots of places.

There are more opportunities to use this instruction set, vpternlog (Bitwise Ternary Logic) lets you do complex terneries with no branches like csel on steriods; how about doing a ? (b ^ c) : (b & c) in one instruction on all the parts of the vector? This stuff comes up very often anywhere dynamic evaluation is happening (choose one boolean or another based on a value).

This kind of logic generalizes to other kinds of “masking.” In “a ? (b + c) : (b - c). Here, a would be considered the mask: anywhere it’s true, the value of b + c is used, and anywhere it’s false, the value of b - c is used.” That too turns into vpternlogd.This business can be spread across a vector so you bam compute all the b+c and then bam compute all the b-c and then bam bam ternlogd select out the ones you need.

Vectorization

My favorite vectorization is implied, like when the JIT can turn a series of reads or stores into one vector operation.

Stephen also covers:

  • Comparisons: improve how vector comparisons are handled.
  • Conditional selects: improves the generated code for ConditionalSelects when the condition is a set of constants.
  • Better const: certain operations are enabled if a non-const argument becomes a constant as part of other optimizations (like inlining).
  • Unblocking other optimizations: various tweaks that enable other optimizations to do a better job

Branching

Many of the JIT’s branch removal tricks that apply to bounds checks are actually generalizable. The ability of the JIT “to reason about the relationship between two ranges and whether one is implied by the other” is significantly improved, resulting in many more branch removal possibilities. In many cases 100% known branches also result in dead code removal.

Similarly dead checks for null can be removed where pointers are known to be not null by inlining or control flow. This removes null handling that provably not needed or else null guards the runtime has to generate. Either way its fewer branches and fewer bytes overall.

Last but not least, dense checks for small values can often be converted to use the bt instruction which can test for many values simultaneously. This can also drastically cut down the number of branches in an alternation.

Write Barriers

In order to correctly garbage collect without scanning all the generations the runtime must keep track of parts of the heap that might be holding on to new Gen0 objects. It uses a table of bits to do this and updating this table when an object reference is written is known as a write barrier. This different from write barriers you may be familiar with that ensure writes have been retired to main memory in a certain order — they are both barriers of a sort.

The JIT can generate several versions of the barrier depending on the situation. The main issue is that sometimes the sort might be on the heap or it might be into say a struct that is on the stack. If we don’t know that it’s a heap store we need an additional check in the helper and there’s a version that does this. But that’s more branching…

In .NET 9 we can correctly generate the unchecked helper in more cases.

Adding to this, the write barrier actually marks a small region of memory as potentially having a pointer to Gen0. New cleverness exploits this to use one write barrier for multiple adjacent writes.

Finally, ref struct types cannot possible be on the heap. When writing into such a struct the write barrier can be elided entirely. No code is the best code.

Object Stack Allocation

I was really starting to salivate when I read “In .NET 9, object stack allocation starts to happen.” but then… “before you get too excited, it’s limited in scope right now, but in the future it’s likely to expand out further.”

But we wants it NOW, precious!

Seriously this direction is fantastic because even though temporary object allocation is fast stack allocation is even faster, and temporary stack lifetime doesn’t “age” other objects by churning the heap. Remember that in the world of the GC time is measured by effective allocation rate and non-heap allocations do not count. This is huge.

Some of the patterns that are recognized at this point include:

  • temporary object creation and value extraction e.g., return (new Foo(bar)).baz;
  • temporary object creation due to conversion like if (o is IDisposable disp) disp.Dispose();

I’m excited about the future of object stack allocation.

Inlining

This makes my head hurt and I know a lot about the VM so “you are not expected to understand this”.

“Generic methods with coreclr and Native AOT work in one of two ways. For value types, every time a generic is used with a different value type, an entire copy of the generic method is made and specialized for that parameter type; it’s as if you wrote a dedicated version of that generic code that wasn’t generic and was instead customized specifically for that type. For reference types, there’s only one copy of the code that’s then shared across all reference types, and it’s parameterized at run-time based on the actual type being used. When you access such a shared generic, at run-time it ends up looking up in a dictionary the information about the generic argument and using the discovered information to inform the rest of the method. Historically, this has not been conducive to inlining.”

The above is basically a trade-off between code bloat for all those reference types and this kind of “late bound type information via a dictionary” mechanism. Here’s the thing, sometimes what looks like a need to access that dynamic object info ends up being unnecessary, maybe it’s dead code for instance. In those cases, all the reasons to not inline the shared code go away.

In .NET 9 a bunch of those cases were fixed. Hence more inlining of shared generic methods.

GC

The major change in the Garbage Collector this time concerns “DATAS, or Dynamically Adapting To Application Sizes.”

“DATAS … dynamically scales how much memory is being consumed by server GC, such that in times of less load, less memory is being used … DATAS is now enabled by default for server GC”

Now this is actually super valuable because in many cases space is speed because less memory means better processor efficiency. Memory reduction can enable you to run more workloads on the same server or on less expensive servers. Remember folks, latency and throughput doesn’t happen in a vacuum. In many context cycles equal cash.

GC “pause times” were also targeted in the Linux builds, the GC’s parallel “vxsort”, used to sort objects by address, is no longer limited to Windows.

VM

The “VM” is basically all the code that manages the orchestration of .NET assemblies, loading classes, creating method tables, interop, exception handling and more. These are the building blocks of everything else. In this version we find many optimizations in these areas.

  • method tables: lazy allocation of some method table info for space savings
  • improvements in method table construction
  • interop improvements in various key methods switching them to the QCALL mechanism rather than the classic FCALL. This included improvements in Marshal, Interlocked, GC, Reflection, Delegate and ValueType
  • exception handling gets a nice boost as the “new” exception handling model (ported from AOT in .NET 8) is enabled by default. In some benchmarks this implementation is 3.5–4x faster.

Mono

“Mono” is used when “the target application requires a small runtime: by default, it’s the runtime that’s used when building mobile apps for Android and iOS today, as well as the runtime used for Blazor WASM apps.”

In .NET 9, the Mono flavor gets quite a bit of love:

  • Save/restoring of profile data. Mono can now use previous executions of code to help train the current execution. It can generate better WASM on-the-fly using these hints. When running in a browser, it stores that information in the browser’s cache for subsequent runs to get better code more quickly.
  • The Static Single Assignment (SSA) form is now used to simplify data flow analysis and thus better optimize the code.
  • The various mono backends need to also handle vector operations efficiently to get any value out of them. These primitives happen in “IndexOfAny, hex encoding and decoding, Base64 encoding and decoding, Guid, and more.”
  • More intrinsics were added like various AdvSimd.Load* and AdvSimd.Store* as well as improvements inSpan<T>.Clear/Fill plus various Unsafe methods, such as BitCast.
  • Complex variance checks in arrays were simplified in arrays with sealed array type: sealed types make many comparisons easier/faster.

Native AOT

When building with Native AOT you get a standalone binary which contains “all of the assembly code for the whole app, inclusive of the code for any core library functionality accessed.” That means it’s very important to do good “shaking” so that you get only what you actually need not what it looks like you might need but actually don’t need. Otherwise, you get the kitchen sink.

You can make improvements in this space by either improving the quality of the dependency analysis to sees “more clearly”, but also, and often more easily, by restructuring the dependencies so that it is obvious what is used and what isn’t.

The way this fails typically goes something like this:

  • the code uses some kind of dynamic dispatch like a string to pick an implementation and the compiler can’t know which one
  • the code has many generic variants that are mostly the same but different enough not to fold
  • highly polymorphically code like ToString implementations might be reachable if the right type is around

This kind of thing happened in System.Security.Cryptography, in ASP.NET due to Select<T> variants, in ArrayPool<T> in Write<TState>, in AppDomain.ToString, in Microsoft.Extensions.DependencyInjection and in System.Private.CoreLib forEnvironment.Version and RuntimeInformation.FrameworkDescription.

Additionally, Native AOT now deduplicates different methods with the same code; unused interface methods could be trimmed away; now the compiler can fully remove the actual interface types; and static constructors only need to be kept if a static field was accessed.

Threading

.NET 9 adds a System.Threading.Lock type you can use instead of lock(object). There is compiler enlightening to use this class if you are on the right runtime and the code is better.

Good oldInterlocked gets the ability to operate over types smaller than int. This is good for space savings of course but also can help Parallel.ForAsync<T>. Also, Exchange and CompareExchange have had their class constraint removed. This means use of Exchange<T> and CompareExchange<T> will work for reference types, a primitive types, or enum types.

.NET 9 also “intrinsifies the Interlocked.And and Interlocked.Or methods for additional platforms; previously they were specially handled on Arm, but now they’re also specially handled on x86/64.”

For Task completion, WhenAll gets improvements and WhenEach is introduced, allowing you to iterate over tasks as they complete.

ECMA 335 defined official memory model for .NET, but real implementations, including coreclr, generally had stronger guarantees. The official .NET memory model has now been documented. This allowed some redundant use of volatile in the code to be safely removed.

“Marking fields or operations as volatile can come with an expense, depending on the circumstance and the target platform. For example, it can restrict the C# compiler and the JIT compiler from performing certain optimizations.”

.NET 8 could inline the fast path of thread-local state (TLS). In .NET 9, Native AOT gets this improvement as well.

On Linux implementing GC synchronization comes down to membarrier with MEMBARRIER_CMD_PRIVATE_EXPEDITED, requiring an earlier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED. This is a lot cheaper if there is only one thread when it happens — now this is usually the case. The startup savings of about 10ms is consequential.

Reflection

Code introspection with Reflection is both powerful and costly, so any improvements are very helpful. Memory economy is the first line of defense, converting internals to ReadOnlySpan<T> where possible and avoiding costly String.Split operations is good practice everywhere.

Delegate.EnumerateInvocationList<TDelegate> method, which returns a allocation-free enumerable for iterating through the delegates.

ActivatorUtilities.CreateInstance gets improvements in .NET 9 which directly benefit dependency injection solutions, heavy reflection customers.

.NET 9 gives “a speedup for field access via a FieldInfo by employing an internal FieldAccessor that’s cached onto the FieldInfo object.” Fields are important.

NET 9, extends [UnsafeAccessor]to generics, as well thereby avoiding reflection costs to access private members in say mocking code.

Several new members became intrinsics in .NET 9. such as typeof(T).IsPrimitive. Importantly key intrinsics like this enable other optimizations like more inlining that can cause buckets of code to collapse. Good building blocks are essential for this though. Inlining plus dead code elimination showed nearly 50x improvement in some benchmarks.

Numerics

Primitive Types

Primitives are everywhere and in .NET 9 “a multitude of PRs have gone into reducing overheads of various operations on these core types.”

  • DateTime.TryParse gets some boxing savings by not computing error info it doesn’t need in the TryParse case
  • Span types make great lookup tables, this pattern now appearsDateTime and TimeSpan
  • various Math.Round and MathF.Round overloads got some love
  • Math.SinCos and MathF.SinCos use RuntimeHelpers.IsKnownConstant to sometimes get constant output overall.
  • Signed division can be replaced with unsigned division if the JIT can prove that both the numerator and denominator are non-negative. This enables more >> optimizations.
  • Nullable<T> gets several improvements to make it cheaper, especially with generics.
  • .NET 9 also optimizes castclass for Nullable<T>, sounds not so important but it comes up in string interpolation. Also the JIT now might know that the cast target is for sure null (because of previous inlining) so all kinds of stuff can be skipped
  • In .NET 9 the JIT can “inline the fast path of the unboxing helper that’s used when extracting a Nullable<T> from an object.”

Big Integer

Stephen says, “A bunch of nice changes have landed for .NET 9.”

  • vectorized BigInteger‘s byte-based constructors using vectorized operations from MemoryMarshal and BinaryPrimitives.
  • added support for the "b" format specifier for formatting and parsing BigInteger as binary, which vectorizes better
  • chooses the really “big integer parsing algorithm” more often, down from 20,000 digits to 1233 where it still works better
  • leading and trailing zeros parsing powers of 10 are handled better
  • BigInteger.Equals learns to useMemoryExtensions.SequenceEqual instead of walking the arrays
  • BigInteger.IsPowerOfTwo learns to use ContainsAnyExcept, to find if all elements after a certain point are 0
  • BigInteger.Multiply, gets an improvement that helps when the first value is much larger than the second
  • BigInteger formatting removed various temporary buffer allocations and got some calculation improvements

Tensor Primitives

“A large stable of numerical operations is now exposed on every numerical type as well as on a set of generic interfaces those types implement. But sometimes you want to perform the same operation on a set of values rather than on an individual value.” This is where TensorPrimitives comes into the picture. It provides “a plethora of numerical APIs, but for spans of them rather than for individual values.”

.NET 8 had 40 such methods but now the list is over 200 in size. So yes, we have a plethora. “And they’re exposed using generics, such that they can work with many more data types than just float”. Stephen also reports, “progress has been made towards also exposing the same set of operations on these vector types.”

Strings, Arrays and Spans

These are the work horses of just about every subsystem in .NET so they are used implicitly nearly ubiquitously. As such, improvements in these areas disproportionately benefit real world cases.

Index Of

In .NET 8 “SearchValues<T> enables optimizing searches, by pre-computing an algorithm to use when searching for a specific set of values (or for anything other than those specific values) and storing that information for later repeated use. Internally, .NET 8 included upwards of 15 different implementations that might be chosen based on the nature of the supplied data. The type was so good at what it did that it was used in over 60 places as part of the .NET 8 release. In .NET 9, it’s used even more.” For instance, Regex.Escape uses it to find characters that need escaping.

The utility comes from the fact that the values and the stored object holding them let you choose an algorithm once and re-use it, even if it needs to make lookup tables or whatever. Many of these algorithms work very well on the given character ranges and many of them can match multiple characters with SIMD instructions.

.NET 9 adds SearchValues<string> support allowing you to search for one of many strings in a Span<char>. “Until .NET 9 there have not been any built-in methods for doing multi-string search, so this new support both adds such support and adds it in a way that is highly efficient.”

How do we get speed out of this? Well, different implementations for different kinds of lists. No items in the list, one item in the list, search for the least common letter in each matching word, and more. The Rabin-Karp rolling hash is an option you just get for asking.

This area is rich with possibilities, the mixture of algorithm and instruction set offers many opportunities to get the best value out of your hardware without having your brain explode. I for one do not want to have to know that “older instruction sets don’t have VPERMB, which is exposed as Avx512Vbmi.PermuteVar64x8”. My head hurts thinking about it.

Regex

The compiled flavor of the regular expression engine has the advantage that it can be enlightened for important patterns. It might take a long time to run because it could backtrack but backtracking is often very limited and the compiled tricks can be very good. In .NET 9 the SearchValues<string> appears as a tool in matching. At least a half dozen other generated code improvements went into this engine.

The non-backtracking engine uses a finite automata and these techniques have been well understood for decades. But there is still room for improvement:

  • It can use deterministic finite automata (DFA) or non-deterministic finite automata (NFA) rules; you fall back to NFA when the size of the automata gets too large, this limit is now 125,000 nodes
  • In many patterns many characters are equivalent, these are called minterms e.g. in [a-z]* all characters are either of the class [a-z] or not. So all the rules can be in terms of those two classes, this can get you a lot of compression. “All but the most niche patterns have fewer than 256 minterms” so one byte storage is enough for that stuff
  • Timeout checks in the non-backtracking engine are largely redundant but were adding overhead, not so much anymore
  • Inner loop specialized for the most common patterns, skipping a few tests; this loop is super hot and only a few instructions in the first place!

Last but not least, Regex Split overloads can support span inputs and return and enumerator of Range objects pointing to the original data. No allocs are the best allocs.

Encoding

“.NET now has a fully-featured Base64Url type that is also very efficient. It actually shares almost all of its implementation with the same functionality on Base64 and Convert, using generic tricks to substitute the different alphabets in an optimized manner.”

Combine that with the “AVX512-optimized Base64 encoding/decoding implementation”, and you get “optimized Base64 encoding and decoding for Arm64.”

Still hungry for more? “The new Convert.ToHexStringLower methods may be used to go directly to the lower-case version;” and “TryToHexString and TryToHexStringLower …format directly into a provided destination span rather than allocating anything. For parsing, “overloads of FromHexString … write into a destination span rather than allocating a new byte[] each time.”

Span, Span and more Span

I got used to Span working on Midori and I don’t know how much of an influence Midori was on the .NET Framework but man was I happen to see Span land in a place where many could use it. If only we also had SparseBuffer but I digress.

Continuing the years-long pattern of #spaninalltheplaces, “[T]he C# params keyword [may] be used with more than just array parameters, but rather any collection type that’s usable with collection expressions… that includes span.”

Add a params ReadOnlySpan<T> overload and callers will use a stack allocated span to access your code. Using this technique, “~40 new overloads for methods that didn’t previously accept spans and now do, and added params to over 20 existing overloads that were already taking spans.” This takes advantage of the fact that fact that ReadOnlySpan<T> can benefit from stackalloc. Not only that, Span initialization can often take advantage of good instructions like cpblk.

Span in more places? How about patterns for nint and nuint, decimal, and string? Ok you can have no-allocation initialization for those too.

MemoryExtensions.Split and MemoryExtensions.SplitAny likewise get overloads for ReadOnlySpan<T> which means no allocation to call them and no allocation to process their SpanSplitEnumerator<T>. Same for StringBuilder.Replace.

Maybe #spaninalltheplaces will really happen soon.

Collections

LINQ

The Linq to Objects implementation seems to be composed of IEnumable<T> chains in some tree shape. Actually, the situation is somewhat more complicated. To avoid lots of redundant interface calls there are more complicated interior nodes that know something about their source. Later nodes can combine with these to get their results at less cost. See “Deep Dive on LINQ” and “An even DEEPER Dive into LINQ” for a more in-depth exploration of how exactly this works. Suffice to say this got complicated over the years.

In .NET 9 this all gets consolidated into a base class Iterator<TSource>. It simplifies the code base and gives us some benefits:

  1. Virtual dispatch is generally a bit cheaper than interface dispatch.
  2. Type tests can be consolidated
  3. All the virtual methods are implemented by any iterator deriving from the base class

With this in place, lots of things can get better. For instance Any can use TryGetFirst which is always there to avoid tons of temporary allocations. The old way was, well, messier, with many combos. I don’t like combos.

There are many more modest improvements throughout Linq: various checks for empty cases, better handling of Enumerable.Chunk, improvements in the predicated version of Any, All, Count, First, and Single. Perhaps the most ambitious significantly similified ToArray and ToList. This is a very good thing because “ToArray in particular is used so ubiquitously that over the years, many folks have attempted to optimize it. In doing so, however, it’s gotten too complex for its own good.” The ToArray improvements used a new helper called SegmentedArrayBuilder which was then re-used in ToList with some adaptation.

Some more ToList improvements happen on the result of a Distinct or Union by enabling HashSet<T>‘s CopyTo implementation to be used. To this some special cases for 0 and 1 length outputs used with Distinct, Union and OrderBy; all of these do not require the general algorithms and are pretty common.

One of my favorites: OrderBy followed by First, or Last can be greatly simplified as there is no need to the sort at all. ThenBy can mess this up, so the code has to handle all those cases. Still, it’s common enough to be worth the effort.

There are others here, the list is quite extensive.

Core Collections

Oh joy and happiness! You no longer have to materialize a string to do a key lookup in a dictionary. If the key is elsewhere, like part of a ReadOnlySpan<char> or ReadOnlySpan<byte> you can use IAlternateEqualityComparer<TAlternate, T> to provide alternate comparisons and expose methods using TAlternateKey.

“[Now] Dictionary<TKey, TValue>, ConcurrentDictionary<TKey, TValue>, FrozenDictionary<TKey, TValue>, HashSet<T>, and FrozenSet<T> all do exactly that.” This is great because, “I [only] need to materialize the string for each ReadOnlySpan<char> in order to store it in the dictionary”.

Once upon a time I wrote a streaming file processor with lots of counts and sums in dictionaries and I had to do all manual collections to avoid billions of temporary string allocations. No one ever has to do this again. All the plumbing has been done for you already.

For fun, we can also take this example one step further. .NET 6 introduced Also, “as part of this alternate key effort, a new overload of GetValueRefOrAddDefault was added that works with it, such that the same operation can be performed with a TAlternateKey.” Lovely.

And! The comparer implementation for string/ReadOnlySpan<char> was extended to apply to EqualityComparer<string>.Default. This means that “if you don’t supply a comparer at all, these collection types will still support ReadOnlySpan<char> lookups.” And this all works for HashSet, too.

Thank you!

A few other things:

  • TrimExcess(int capacity) was added to HashSet<T>, Queue<T>, and Stack<T>, enabling more fine-grained control over memory
  • IsSubsetOf, IsProperSubsetOf, and SetEquals were improved
  • Dictionary<T, T> was replaced with the cheaper HashSet<T> in a few places

Some of the less used collections also got improvements:

  • In PriorityQueue<TElement, TPriority> bulk inserting into an empty queue withEnqueueRange(IEnumerable<Telement>, TPriority) needs no heapification, just copy and go
  • In BitArray the many vectorized ops add Vector512 support
  • List<T> loses an unnecessary copy in Insert
  • FrozenDictionary<TKey, TValue> and FrozenSet<T> for strings the bitmap of string lengths in the table was extended to 64 bits. This gives an accurate early-out.

Compression

Traditionally .NET used whatever zlib happened to be on the machine in question. Windows has none so intel/zlib was used “but the intel/zlib repository was archived and is not actively being maintained by Intel.”

“To simplify things, to improve consistency and performance across more platforms, and to move to an actively supported and evolving implementation, this changes … .NET 9 now includes the zlib functionality built-in across Windows, Linux, and macOS, based on the newer zlib-ng/zlib-ng

There isn’t much to say here. Naturally this is all new code so it will perform differently, Stephen includes benchmarks in the relevant section.

Cryptography

There are several general and impactful performance improvements in this area.

  • Random number generation: we can use a single call to NextBytes for random numbers that are ≤ 256 and a power of 2. Get your bytes and mask if needed. This comes up a lot and saves many calls to the system RandomNumberGenerator.
  • System.Security.Cryptography uses NativeMemory.Alloc and NativeMemory.Free instead of more costly Pinned Objects.
  • interop paths for CngKey properties like ExportPolicy, IsMachineKey, and KeyUsage, work with an int on the stack, passing a pointer to it to the OS, avoiding the need to allocate.
  • the crypto libraries often use a CloneByteArray helper. This helper was cloning empty arrays for no good reason.
  • PublicKey uses AsnEncodedData was sometimes transferred ownership of temporary instances, but it still cloned them.
  • AddEntry now takes ReadOnlySpan<string> (spans in all the places!) instead of string[]. Such call sites use stack space to store the strings passed to AddEntryinstead of heap.
  • The OidCollection didn’t have a way to specify the size of the collection to avoid growth overhead even though many of the callers did know the exact required size. Likewise, CborWriter lacked the ability to presize.
  • CborWriter was increasing its size by a fixed amount which led to O(N²) growth cost. It’s now on a doubling plan like other collections.
  • A few properties on CngKey (Algorithm, AlgorithmGroup, and Provider) were memoized because the answer is always the same. “These are particularly expensive because the OS implementation of these functions needs to make a remote procedure call to another Windows process to access the relevant data.”

Networking

.NET 9 includes some improvements in steady state HTTP performance but also in TLS connection establishment.

  • reduced allocations associated with the TLS handshake
  • avoiding some unnecessary SafeHandle allocation
  • clients using client certificates can benefit from TLS resumption
  • the HTTP connection pool synchronization mechanism uses an opportunistic layer of lockless synchronization in a ConcurrentStack<T>

With a connection established the runtime:

  • uses vectorized helpers Ascii.FromUtf16 to write the request head
  • avoids extra async state machines needed only in rare logging cases
  • removes allocations by computing and caching some bytes that need to be written out on every request
  • special cases the most common media types used in JsonContent and StringContent
  • special-cases TryAddWithoutValidation for multiple values provided byIList<string>
  • avoids large char[] allocation in the parsing of Alt-Svc headers by using ArrayPool

The WebUtility and HttpUtility types both got more efficient:

  • HtmlEncode begins using the faster helperSearchValues<char>
  • UrlEncode similarly gets wins using SearchValues<char>
  • UrlEncode now uses string.Create and does its work in-place
  • UrlEncodeToBytes and UrlDecodeToBytes use stack space smaller inputs, and useSearchValues<byte> to optimize the search for invalid bytes
  • UrlPathEncode, uses ArrayPool for memory
  • JavaScriptStringEncode leverages SearchValues (SearchValues for all!)
  • ParseQueryString uses stackalloc for smaller input lengths, and uses string.Substring with span slicing

Other memor savings:

  • Uri gained TryEscapeDataString and TryUnescapeDataString, which store to spans; used in FormUrlEncodedContent for space and speed gains

Improvements elsewhere helped out WebSockets with TryValidateUtf8 costs dropping significantly.

JSON

System.Text.Json got several improvements in .NET 9.

JsonSerializer.SerializeAsync overloads that target PipeWriter in addition to the existing overloads that target Stream. That way, whether you have a Stream or a PipeWriter, JsonSerializer will natively work with either.” ASP.NET uses System.IO.Pipelines internally so no more shims.

JsonSerializer gets allocation-free parsing solution for enums with the GetAlternateLookup support mentioned earlier.

System.Text.Json gets many other improvements:

  • JsonProperty.WriteTo uses writer.WritePropertyName(Name) but now it can directly write with UTF8 bytes
  • Base64EncodeAndWrite can now encode a source ReadOnlySpan<byte> directly into its destination Span<byte>
  • JsonNode.GetPath avoids List<string> allocs extracting its path segments in the reverse order and “building the resulting path in stack space or an array rented from the ArrayPool.”
  • JsonNode.ToString and JsonNode.ToJsonString use existing caches of PooledByteBufferWriter and Utf8JsonWriter, avoiding allocs
  • JsonObject uses the count of items in its input Enumerable, if available, to pre-size its dictionary
  • JsonValue.CreateFromElement accesses JsonElement.ValueKind repeatedly to determine how to process the data; it was “tweaked” to only access the property once
  • JsonElement.GetRawText can use JsonMarshal.GetRawUtf8Value to return a span over the original data, no allocs
  • Utf8JsonReader and JsonSerializer now support multiple top-level JSON objects from an input. No more pre-parsing to avoid errors due to multiple objects, and no more work-around code

Diagnostics

System.Diagnostics.Metrics.Meter Counter and UpDownCounter are often used for hot-path tracking of metrics like number of active or queued requests. In production environments, these instruments are frequently bombarded from multiple threads concurrently.”

This all needs to be thread-safe and highly scalable. In .NET 9 various lock-free patterns were used to accelerate these.

  • Interlocked.CompareExchange to do the addition of doubles
  • Use more than one counter to accumulate the total for less contention
  • Cache align the doubles so that there is no false sharing of cache lines

More about false sharing in “Let’s Talk Parallel Programming in the Deep .NET series

Measurement gets a new constructor that takes a TagList, avoiding the overhead associated with adding custom tags.

TagList models a list of key/value pairs, but it has built in fields for a small number of tags without allocating an array. Now an [InlineArray] is used, which enabling access via spans in all the cases.

Peanut Butter

Stephen’s “peanut butter” list is really all over the place. I’ll just briefly mention some of the topics here.

  • StreamWriter.Nullbecomes an instance of NullStreamWriter
  • NonCryptographicHashAlgorithm changes Append to allocate a small temporary Stream object that wraps this NonCryptographicHashAlgorithm instance for throughput benefits
  • virtual removed from “a smattering” of internal members that didn’t need it
  • replaced a bunch of Span<T> instances with ReadOnlySpan<T> to reduce overhead associated with the covariance check
  • hundreds of fields were upgraded to readonly or const when possible
  • In MemoryCache only one of thread can trigger compaction at a time
  • BinaryReader needs extra allocations to read text, these are now pay for play
  • ArrayBufferWriter.Clear sets the written count to 0 and also clears the underlying buffer. The new ResetWrittenCount only clears the written count.
  • The File class methods like File.WriteAllText get Span based overloads. #spansinalltheplaces
  • MailAddressCollection uses a builder instead of string concat
  • The config source generator gets some changes to avoid unnecessary display class allocations for some lambdas
  • StreamOnSqlBytes properly overridesRead/Write avoiding base class cost (this has happened often).
  • NumberFormatInfo uses singletons to initialize its NumberGroupSizes, CurrentGroupSizes, and PercentGroupSizes instead of new arrays
  • ColorTranslator avoids many P/Invokes by getting all color components in one call (Windows only issue)

Acknowledgements

I have to thank Stephen for the amazing source material. This summary is roughly 5% the size of the full write up and I’m sure I have introduced some inaccuracies in the summarization but hopefully not too many. If there are any errors here, I’m sure they are my own.

If you’re looking for more clarity, or the supporting benchmark details, click through the section headers to get the full story. This article is a good gateway to the full material.

--

--

Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.