Performance Improvements in .NET 9
Last year I wrote a summary of Stephen Toub’s Performance Improvements in .NET 8. That was so popular we’ve decided to do it again. Like before this is going to be personally biased, at least somewhat, towards things I happen to care the most about — but that can’t be helped. Also like before, I have cross linked to Stephen’s full article which has omg way more details on each item. The structure of this summary exactly parallels Stephen’s new article: Performance Improvements in .NET 9.
[Note: there are many direct quotes from the article, and I have removed internal quotations for clarity as a matter of course. Use the title links to find the original quotes in full.]
JIT
Nobody should be surprised that this is going to be the area I am most excited about. The opportunity for code quality improvement usually is the biggest attack vector for performance and this year doesn’t disappoint. Many of the themes from .NET 8 get additional love, including the multi-stage code generation features.
Fully half of the content is JIT oriented with some cases more mainstream than others but a recurring theme here is, again, dynamic compilation. Both in terms of reacting to actual workloads — code generation for what is hot — but also by targeting available fancy processor features when they are available. It can be very challenging to do the latter and so you often get lowest-common-denominator code generation. We don’t want any of that in 2024.
“PGO is a feature that enables the JIT to profile code and use what it learns from that profiling to help it generate more efficient code based on the exact usage patterns of the application” — this is a trick well known in the Javascript world where everything is dynamic, and this approach is the only way to get decent code quality, but it works for typed languages too. The general idea is that having found important code and then observed the code for a while, the JIT can do a better job given a another try. Its important to not do this universally because in many cases the effort of trying to make better code takes longer than just running the dumb code would have.
PGO now gains the ability to track type info, so (T)obj
and obj is T
both have a chance to be optimized.
Additionally common integer values may be observed resulting in inline code for the most common case. Sounds not that exciting but forwarding those constants to helper methods like MemMove
, CopyTo
and SequenceEqual
means that the best code for those methods can be generated.
Using one feature to light up other features is a common theme in .NET 9.
Tier 0 code is the initial, simplest, code generation. The JIT is most concerned with getting some code out there because in many cases the code only runs once and thinking a long time equals “I could have been done already”. However, there are some optimizations that are so good you just gotta do them. To wit: ArgumentNullException.ThrowIfNull
is used all over the place for argument validation. In some cases there would be boxing just for asking the question is it null — bizarre since a boxed value can’t be null. The JIT special cases ThrowIfNull
to avoid this now. Boxing bad. Less boxing good. Similar cases avoided boxing in CreateSpan
and many async/await
paths.
Loops uh loop so they are important. Downcounting loops create better assembly so rewriting upcount loops to downcount can be a win even if you need to burn another register. But even better if the loop index variable could go either way. Or if a strength reduction can give you a pointer that marches though and array and no costly shift/add for your trouble. More registers sounds bad but escape analysis can tell you that eax
maybe can be used and it’s always scratch. And an extra inc
might easily pay for itself — so the JIT learns some new tricks here. If you’re wondering what a strength reduction looks like maybe it converts something like this rax+rsi*4+10
into just rsi
by incrementing rsi
as we go along. We call it strength reduction because it reduces a multiply to an add. If we do that we don’t need the upcounting index (in rsi
in this example). Better loop, less math. Is this a trick question?
Bounds checks ruin your day because (1) it’s more code (2) the check includes a branch so more branches you have to predict (3) those extra branches you didn’t need reduce the chance of correctly predicting branches that are unavoidable. So less is better. The JIT is able to handle more cases, like indexing within a span. Important because the span itself often has stop conditions for length and those are enough to validate the index and that happens all over the place. Add to this some cases where you read the array/span from the end, and you get some real goodness.
Some other items in this area: a successful bounds check can guarantee a positive index value which allows you to optimize subsequent /
and %
operations. Dividing by a possibly negative value needs some extra checks that can be elided.
Stephen really condenses this bit. I’ll have to condense even more:
- better barriers using
stlur
instead ofdmb
- better switches using bit test for some patterns instead of jump table
- better conditionals using
csel/csinc
avoiding branches yay! - better multiplies by using the combined multiply and add/sub combos
- better loads to read several locations using
ldp
instead ofldr
It all adds up!
SVE is not quite like other SIMD instructions. “SVE, or Scalable Vector Extensions is an ISA from Arm that’s a bit different. The instructions in SVE don’t operate on a fixed size. Rather, the specification allows for them to operate on sizes from 128 bits up to 2048 bits, and the specific hardware can choose which size to use.” If that sounds complicated to you, you are not alone. A lot of what happens here is instruction selection under the hood, or on already vectorized paths in standard helper methods. Much of the SVE support is tied to internal use of Vector<T>
and even so “designing and enabling the SVE support is a monstrous, multi-year effort, and while the support is functional and folks are encouraged to take it for a spin, it’s not yet baked enough for us to be 100% confident the shape won’t need to evolve” so it’s marked [Experimental]
meaning you can get breaking changes in the future.
Doing just one new instruction set would be boring. So “.NET 9 now also supports AVX10.1 (AVX10 version 1). AVX10.1 provides everything AVX512 provides, all of the base support, the updated encodings, support for embedded broadcasts, masking, and so on, but it only requires 256-bit support in the hardware.” I have to say I’m puzzled by the fact that “at the time of this writing, there aren’t actually any chips on the market that support AVX10.1, but they’re expected in the foreseeable future.” How did they test it?
At last we get to the stalwart and available AVX512, it gets “broad support” in .NET 9. Stephen goes into elementary use cases where even very simple operations like bulk zeroing can benefit from vector instructions under the hood. My favorite use of vectoring is under the hood where important stuff that happens all the time just gets magically faster. The article reports that “vmovdqu32
(move unaligned packed doubleword integer values) can be used to zero twice as much at a time (64 bytes) as vmovdqa
(move aligned packed integer values).” I like zeroing twice as much at a time because it shows up in lots of places.
There are more opportunities to use this instruction set, vpternlog
(Bitwise Ternary Logic) lets you do complex terneries with no branches like csel
on steriods; how about doing a ? (b ^ c) : (b & c)
in one instruction on all the parts of the vector? This stuff comes up very often anywhere dynamic evaluation is happening (choose one boolean or another based on a value).
This kind of logic generalizes to other kinds of “masking.” In “a ? (b + c) : (b - c)
. Here, a
would be considered the mask: anywhere it’s true, the value of b + c
is used, and anywhere it’s false, the value of b - c
is used.” That too turns into vpternlogd.
This business can be spread across a vector so you bam compute all the b+c
and then bam compute all the b-c
and then bam bam ternlogd
select out the ones you need.
My favorite vectorization is implied, like when the JIT can turn a series of reads or stores into one vector operation.
Stephen also covers:
- Comparisons: improve how vector comparisons are handled.
- Conditional selects: improves the generated code for
ConditionalSelects
when the condition is a set of constants. - Better const: certain operations are enabled if a non-const argument becomes a constant as part of other optimizations (like inlining).
- Unblocking other optimizations: various tweaks that enable other optimizations to do a better job
Many of the JIT’s branch removal tricks that apply to bounds checks are actually generalizable. The ability of the JIT “to reason about the relationship between two ranges and whether one is implied by the other” is significantly improved, resulting in many more branch removal possibilities. In many cases 100% known branches also result in dead code removal.
Similarly dead checks for null
can be removed where pointers are known to be not null by inlining or control flow. This removes null handling that provably not needed or else null guards the runtime has to generate. Either way its fewer branches and fewer bytes overall.
Last but not least, dense checks for small values can often be converted to use the bt
instruction which can test for many values simultaneously. This can also drastically cut down the number of branches in an alternation.
In order to correctly garbage collect without scanning all the generations the runtime must keep track of parts of the heap that might be holding on to new Gen0 objects. It uses a table of bits to do this and updating this table when an object reference is written is known as a write barrier. This different from write barriers you may be familiar with that ensure writes have been retired to main memory in a certain order — they are both barriers of a sort.
The JIT can generate several versions of the barrier depending on the situation. The main issue is that sometimes the sort might be on the heap or it might be into say a struct that is on the stack. If we don’t know that it’s a heap store we need an additional check in the helper and there’s a version that does this. But that’s more branching…
In .NET 9 we can correctly generate the unchecked helper in more cases.
Adding to this, the write barrier actually marks a small region of memory as potentially having a pointer to Gen0. New cleverness exploits this to use one write barrier for multiple adjacent writes.
Finally, ref struct
types cannot possible be on the heap. When writing into such a struct the write barrier can be elided entirely. No code is the best code.
I was really starting to salivate when I read “In .NET 9, object stack allocation starts to happen.” but then… “before you get too excited, it’s limited in scope right now, but in the future it’s likely to expand out further.”
But we wants it NOW, precious!
Seriously this direction is fantastic because even though temporary object allocation is fast stack allocation is even faster, and temporary stack lifetime doesn’t “age” other objects by churning the heap. Remember that in the world of the GC time is measured by effective allocation rate and non-heap allocations do not count. This is huge.
Some of the patterns that are recognized at this point include:
- temporary object creation and value extraction e.g.,
return (new Foo(bar)).baz;
- temporary object creation due to conversion like
if (o is IDisposable disp) disp.Dispose();
I’m excited about the future of object stack allocation.
This makes my head hurt and I know a lot about the VM so “you are not expected to understand this”.
“Generic methods with coreclr and Native AOT work in one of two ways. For value types, every time a generic is used with a different value type, an entire copy of the generic method is made and specialized for that parameter type; it’s as if you wrote a dedicated version of that generic code that wasn’t generic and was instead customized specifically for that type. For reference types, there’s only one copy of the code that’s then shared across all reference types, and it’s parameterized at run-time based on the actual type being used. When you access such a shared generic, at run-time it ends up looking up in a dictionary the information about the generic argument and using the discovered information to inform the rest of the method. Historically, this has not been conducive to inlining.”
The above is basically a trade-off between code bloat for all those reference types and this kind of “late bound type information via a dictionary” mechanism. Here’s the thing, sometimes what looks like a need to access that dynamic object info ends up being unnecessary, maybe it’s dead code for instance. In those cases, all the reasons to not inline the shared code go away.
In .NET 9 a bunch of those cases were fixed. Hence more inlining of shared generic methods.
GC
The major change in the Garbage Collector this time concerns “DATAS, or Dynamically Adapting To Application Sizes.”
“DATAS … dynamically scales how much memory is being consumed by server GC, such that in times of less load, less memory is being used … DATAS is now enabled by default for server GC”
Now this is actually super valuable because in many cases space is speed because less memory means better processor efficiency. Memory reduction can enable you to run more workloads on the same server or on less expensive servers. Remember folks, latency and throughput doesn’t happen in a vacuum. In many context cycles equal cash.
GC “pause times” were also targeted in the Linux builds, the GC’s parallel “vxsort”, used to sort objects by address, is no longer limited to Windows.
VM
The “VM” is basically all the code that manages the orchestration of .NET assemblies, loading classes, creating method tables, interop, exception handling and more. These are the building blocks of everything else. In this version we find many optimizations in these areas.
- method tables: lazy allocation of some method table info for space savings
- improvements in method table construction
- interop improvements in various key methods switching them to the
QCALL
mechanism rather than the classicFCALL
. This included improvements inMarshal
,Interlocked
,GC,
Reflection
,Delegate
andValueType
- exception handling gets a nice boost as the “new” exception handling model (ported from AOT in .NET 8) is enabled by default. In some benchmarks this implementation is 3.5–4x faster.
Mono
“Mono” is used when “the target application requires a small runtime: by default, it’s the runtime that’s used when building mobile apps for Android and iOS today, as well as the runtime used for Blazor WASM apps.”
In .NET 9, the Mono flavor gets quite a bit of love:
- Save/restoring of profile data. Mono can now use previous executions of code to help train the current execution. It can generate better WASM on-the-fly using these hints. When running in a browser, it stores that information in the browser’s cache for subsequent runs to get better code more quickly.
- The Static Single Assignment (SSA) form is now used to simplify data flow analysis and thus better optimize the code.
- The various mono backends need to also handle vector operations efficiently to get any value out of them. These primitives happen in “
IndexOfAny
, hex encoding and decoding, Base64 encoding and decoding,Guid
, and more.” - More intrinsics were added like various
AdvSimd.Load*
andAdvSimd.Store*
as well as improvements inSpan<T>.Clear/Fill
plus variousUnsafe
methods, such asBitCast
. - Complex variance checks in arrays were simplified in arrays with sealed array type: sealed types make many comparisons easier/faster.
Native AOT
When building with Native AOT you get a standalone binary which contains “all of the assembly code for the whole app, inclusive of the code for any core library functionality accessed.” That means it’s very important to do good “shaking” so that you get only what you actually need not what it looks like you might need but actually don’t need. Otherwise, you get the kitchen sink.
You can make improvements in this space by either improving the quality of the dependency analysis to sees “more clearly”, but also, and often more easily, by restructuring the dependencies so that it is obvious what is used and what isn’t.
The way this fails typically goes something like this:
- the code uses some kind of dynamic dispatch like a string to pick an implementation and the compiler can’t know which one
- the code has many generic variants that are mostly the same but different enough not to fold
- highly polymorphically code like
ToString
implementations might be reachable if the right type is around
This kind of thing happened in System.Security.Cryptography
, in ASP.NET due to Select<T>
variants, in ArrayPool<T>
in Write<TState>
, in AppDomain.ToString
, in Microsoft.Extensions.DependencyInjection
and in System.Private.CoreLib
forEnvironment.Version
and RuntimeInformation.FrameworkDescription
.
Additionally, Native AOT now deduplicates different methods with the same code; unused interface methods could be trimmed away; now the compiler can fully remove the actual interface types; and static
constructors only need to be kept if a static field was accessed.
Threading
.NET 9 adds a System.Threading.Lock
type you can use instead of lock(object)
. There is compiler enlightening to use this class if you are on the right runtime and the code is better.
Good oldInterlocked
gets the ability to operate over types smaller than int
. This is good for space savings of course but also can help Parallel.ForAsync<T>
. Also, Exchange
and CompareExchange
have had their class
constraint removed. This means use of Exchange<T>
and CompareExchange<T>
will work for reference types, a primitive types, or enum types.
.NET 9 also “intrinsifies the Interlocked.And
and Interlocked.Or
methods for additional platforms; previously they were specially handled on Arm, but now they’re also specially handled on x86/64.”
For Task
completion, WhenAll
gets improvements and WhenEach
is introduced, allowing you to iterate over tasks as they complete.
ECMA 335 defined official memory model for .NET, but real implementations, including coreclr
, generally had stronger guarantees. The official .NET memory model has now been documented. This allowed some redundant use of volatile
in the code to be safely removed.
“Marking fields or operations as volatile
can come with an expense, depending on the circumstance and the target platform. For example, it can restrict the C# compiler and the JIT compiler from performing certain optimizations.”
.NET 8 could inline the fast path of thread-local state (TLS). In .NET 9, Native AOT gets this improvement as well.
On Linux implementing GC synchronization comes down to membarrier
with MEMBARRIER_CMD_PRIVATE_EXPEDITED
, requiring an earlier MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED.
This is a lot cheaper if there is only one thread when it happens — now this is usually the case. The startup savings of about 10ms is consequential.
Reflection
Code introspection with Reflection is both powerful and costly, so any improvements are very helpful. Memory economy is the first line of defense, converting internals to ReadOnlySpan<T>
where possible and avoiding costly String.Split
operations is good practice everywhere.
Delegate.EnumerateInvocationList<TDelegate>
method, which returns a allocation-free enumerable for iterating through the delegates.
ActivatorUtilities.CreateInstance
gets improvements in .NET 9 which directly benefit dependency injection solutions, heavy reflection customers.
.NET 9 gives “a speedup for field access via a FieldInfo
by employing an internal FieldAccessor
that’s cached onto the FieldInfo
object.” Fields are important.
NET 9, extends [UnsafeAccessor]
to generics, as well thereby avoiding reflection costs to access private members in say mocking code.
Several new members became intrinsics in .NET 9. such as typeof(T).IsPrimitive
. Importantly key intrinsics like this enable other optimizations like more inlining that can cause buckets of code to collapse. Good building blocks are essential for this though. Inlining plus dead code elimination showed nearly 50x improvement in some benchmarks.
Numerics
Primitives are everywhere and in .NET 9 “a multitude of PRs have gone into reducing overheads of various operations on these core types.”
DateTime.TryParse
gets some boxing savings by not computing error info it doesn’t need in theTryParse
caseSpan
types make great lookup tables, this pattern now appearsDateTime
andTimeSpan
- various
Math.Round
andMathF.Round
overloads got some love Math.SinCos
andMathF.SinCos
useRuntimeHelpers.IsKnownConstant
to sometimes get constant output overall.- Signed division can be replaced with unsigned division if the JIT can prove that both the numerator and denominator are non-negative. This enables more >> optimizations.
Nullable<T>
gets several improvements to make it cheaper, especially with generics.- .NET 9 also optimizes
castclass
forNullable<T>
, sounds not so important but it comes up in string interpolation. Also the JIT now might know that the cast target is for sure null (because of previous inlining) so all kinds of stuff can be skipped - In .NET 9 the JIT can “inline the fast path of the unboxing helper that’s used when extracting a
Nullable<T>
from anobject
.”
Stephen says, “A bunch of nice changes have landed for .NET 9.”
- vectorized
BigInteger
‘s byte-based constructors using vectorized operations fromMemoryMarshal
andBinaryPrimitives
. - added support for the
"b"
format specifier for formatting and parsingBigInteger
as binary, which vectorizes better - chooses the really “big integer parsing algorithm” more often, down from 20,000 digits to 1233 where it still works better
- leading and trailing zeros parsing powers of 10 are handled better
BigInteger.Equals
learns to useMemoryExtensions.SequenceEqual
instead of walking the arraysBigInteger.IsPowerOfTwo
learns to useContainsAnyExcept
, to find if all elements after a certain point are 0BigInteger.Multiply
, gets an improvement that helps when the first value is much larger than the secondBigInteger
formatting removed various temporary buffer allocations and got some calculation improvements
“A large stable of numerical operations is now exposed on every numerical type as well as on a set of generic interfaces those types implement. But sometimes you want to perform the same operation on a set of values rather than on an individual value.” This is where TensorPrimitives
comes into the picture. It provides “a plethora of numerical APIs, but for spans of them rather than for individual values.”
.NET 8 had 40 such methods but now the list is over 200 in size. So yes, we have a plethora. “And they’re exposed using generics, such that they can work with many more data types than just float
”. Stephen also reports, “progress has been made towards also exposing the same set of operations on these vector types.”
Strings, Arrays and Spans
These are the work horses of just about every subsystem in .NET so they are used implicitly nearly ubiquitously. As such, improvements in these areas disproportionately benefit real world cases.
In .NET 8 “SearchValues<T>
enables optimizing searches, by pre-computing an algorithm to use when searching for a specific set of values (or for anything other than those specific values) and storing that information for later repeated use. Internally, .NET 8 included upwards of 15 different implementations that might be chosen based on the nature of the supplied data. The type was so good at what it did that it was used in over 60 places as part of the .NET 8 release. In .NET 9, it’s used even more.” For instance, Regex.Escape
uses it to find characters that need escaping.
The utility comes from the fact that the values and the stored object holding them let you choose an algorithm once and re-use it, even if it needs to make lookup tables or whatever. Many of these algorithms work very well on the given character ranges and many of them can match multiple characters with SIMD instructions.
.NET 9 adds SearchValues<string>
support allowing you to search for one of many strings in a Span<char>
. “Until .NET 9 there have not been any built-in methods for doing multi-string search, so this new support both adds such support and adds it in a way that is highly efficient.”
How do we get speed out of this? Well, different implementations for different kinds of lists. No items in the list, one item in the list, search for the least common letter in each matching word, and more. The Rabin-Karp rolling hash is an option you just get for asking.
This area is rich with possibilities, the mixture of algorithm and instruction set offers many opportunities to get the best value out of your hardware without having your brain explode. I for one do not want to have to know that “older instruction sets don’t have VPERMB
, which is exposed as Avx512Vbmi.PermuteVar64x8
”. My head hurts thinking about it.
The compiled flavor of the regular expression engine has the advantage that it can be enlightened for important patterns. It might take a long time to run because it could backtrack but backtracking is often very limited and the compiled tricks can be very good. In .NET 9 the SearchValues<string>
appears as a tool in matching. At least a half dozen other generated code improvements went into this engine.
The non-backtracking engine uses a finite automata and these techniques have been well understood for decades. But there is still room for improvement:
- It can use deterministic finite automata (DFA) or non-deterministic finite automata (NFA) rules; you fall back to NFA when the size of the automata gets too large, this limit is now 125,000 nodes
- In many patterns many characters are equivalent, these are called minterms e.g. in
[a-z]*
all characters are either of the class[a-z]
or not. So all the rules can be in terms of those two classes, this can get you a lot of compression. “All but the most niche patterns have fewer than 256 minterms” so one byte storage is enough for that stuff - Timeout checks in the non-backtracking engine are largely redundant but were adding overhead, not so much anymore
- Inner loop specialized for the most common patterns, skipping a few tests; this loop is super hot and only a few instructions in the first place!
Last but not least, Regex Split
overloads can support span inputs and return and enumerator of Range
objects pointing to the original data. No allocs are the best allocs.
“.NET now has a fully-featured Base64Url
type that is also very efficient. It actually shares almost all of its implementation with the same functionality on Base64
and Convert
, using generic tricks to substitute the different alphabets in an optimized manner.”
Combine that with the “AVX512-optimized Base64 encoding/decoding implementation”, and you get “optimized Base64 encoding and decoding for Arm64.”
Still hungry for more? “The new Convert.ToHexStringLower
methods may be used to go directly to the lower-case version;” and “TryToHexString
and TryToHexStringLower
…format directly into a provided destination span rather than allocating anything. For parsing, “overloads of FromHexString
… write into a destination span rather than allocating a new byte[]
each time.”
I got used to Span
working on Midori and I don’t know how much of an influence Midori was on the .NET Framework but man was I happen to see Span
land in a place where many could use it. If only we also had SparseBuffer
but I digress.
Continuing the years-long pattern of #spaninalltheplaces, “[T]he C# params
keyword [may] be used with more than just array parameters, but rather any collection type that’s usable with collection expressions… that includes span.”
Add a params ReadOnlySpan<T>
overload and callers will use a stack allocated span
to access your code. Using this technique, “~40 new overloads for methods that didn’t previously accept spans and now do, and added params
to over 20 existing overloads that were already taking spans.” This takes advantage of the fact that fact that ReadOnlySpan<T>
can benefit from stackalloc
. Not only that, Span
initialization can often take advantage of good instructions like cpblk.
Span
in more places? How about patterns for nint
and nuint
, decimal
, and string
? Ok you can have no-allocation initialization for those too.
MemoryExtensions.Split
and MemoryExtensions.SplitAny
likewise get overloads for ReadOnlySpan<T>
which means no allocation to call them and no allocation to process their SpanSplitEnumerator<T>
. Same for StringBuilder.Replace
.
Maybe #spaninalltheplaces will really happen soon.
Collections
The Linq to Objects implementation seems to be composed of IEnumable<T>
chains in some tree shape. Actually, the situation is somewhat more complicated. To avoid lots of redundant interface calls there are more complicated interior nodes that know something about their source. Later nodes can combine with these to get their results at less cost. See “Deep Dive on LINQ” and “An even DEEPER Dive into LINQ” for a more in-depth exploration of how exactly this works. Suffice to say this got complicated over the years.
In .NET 9 this all gets consolidated into a base class Iterator<TSource>
. It simplifies the code base and gives us some benefits:
- Virtual dispatch is generally a bit cheaper than interface dispatch.
- Type tests can be consolidated
- All the virtual methods are implemented by any iterator deriving from the base class
With this in place, lots of things can get better. For instance Any
can use TryGetFirst
which is always there to avoid tons of temporary allocations. The old way was, well, messier, with many combos. I don’t like combos.
There are many more modest improvements throughout Linq: various checks for empty cases, better handling of Enumerable.Chunk
, improvements in the predicated version of Any
, All
, Count
, First
, and Single
. Perhaps the most ambitious significantly similified ToArray
and ToList.
This is a very good thing because “ToArray
in particular is used so ubiquitously that over the years, many folks have attempted to optimize it. In doing so, however, it’s gotten too complex for its own good.” The ToArray
improvements used a new helper called SegmentedArrayBuilder
which was then re-used in ToList
with some adaptation.
Some more ToList
improvements happen on the result of a Distinct
or Union
by enabling HashSet<T>
‘s CopyTo
implementation to be used. To this some special cases for 0 and 1 length outputs used with Distinct
, Union
and OrderBy;
all of these do not require the general algorithms and are pretty common.
One of my favorites: OrderBy
followed by First
, or Last
can be greatly simplified as there is no need to the sort at all. ThenBy
can mess this up, so the code has to handle all those cases. Still, it’s common enough to be worth the effort.
There are others here, the list is quite extensive.
Oh joy and happiness! You no longer have to materialize a string to do a key lookup in a dictionary. If the key is elsewhere, like part of a ReadOnlySpan<char>
or ReadOnlySpan<byte>
you can use IAlternateEqualityComparer<TAlternate, T>
to provide alternate comparisons and expose methods using TAlternateKey.
“[Now] Dictionary<TKey, TValue>
, ConcurrentDictionary<TKey, TValue>
, FrozenDictionary<TKey, TValue>
, HashSet<T>
, and FrozenSet<T>
all do exactly that.” This is great because, “I [only] need to materialize the string
for each ReadOnlySpan<char>
in order to store it in the dictionary”.
Once upon a time I wrote a streaming file processor with lots of counts and sums in dictionaries and I had to do all manual collections to avoid billions of temporary string allocations. No one ever has to do this again. All the plumbing has been done for you already.
For fun, we can also take this example one step further. .NET 6 introduced Also, “as part of this alternate key effort, a new overload of GetValueRefOrAddDefault
was added that works with it, such that the same operation can be performed with a TAlternateKey
.” Lovely.
And! The comparer implementation for string
/ReadOnlySpan<char>
was extended to apply to EqualityComparer<string>.Default.
This means that “if you don’t supply a comparer at all, these collection types will still support ReadOnlySpan<char>
lookups.” And this all works for HashSet
, too.
Thank you!
A few other things:
TrimExcess(int capacity)
was added toHashSet<T>
,Queue<T>
, andStack<T>
, enabling more fine-grained control over memoryIsSubsetOf
,IsProperSubsetOf
, andSetEquals
were improvedDictionary<T, T>
was replaced with the cheaperHashSet<T>
in a few places
Some of the less used collections also got improvements:
- In
PriorityQueue<TElement, TPriority>
bulk inserting into an empty queue withEnqueueRange(IEnumerable<Telement>, TPriority)
needs no heapification, just copy and go - In
BitArray
the many vectorized ops addVector512
support List<T>
loses an unnecessary copy inInsert
FrozenDictionary<TKey, TValue>
andFrozenSet<T>
for strings the bitmap of string lengths in the table was extended to 64 bits. This gives an accurate early-out.
Compression
Traditionally .NET used whatever zlib
happened to be on the machine in question. Windows has none so intel/zlib
was used “but the intel/zlib
repository was archived and is not actively being maintained by Intel.”
“To simplify things, to improve consistency and performance across more platforms, and to move to an actively supported and evolving implementation, this changes … .NET 9 now includes the zlib
functionality built-in across Windows, Linux, and macOS, based on the newer zlib-ng/zlib-ng
”
There isn’t much to say here. Naturally this is all new code so it will perform differently, Stephen includes benchmarks in the relevant section.
Cryptography
There are several general and impactful performance improvements in this area.
- Random number generation: we can use a single call to
NextBytes
for random numbers that are ≤ 256 and a power of 2. Get your bytes and mask if needed. This comes up a lot and saves many calls to the systemRandomNumberGenerator
. System.Security.Cryptography
usesNativeMemory.Alloc
andNativeMemory.Free
instead of more costly Pinned Objects.- interop paths for
CngKey
properties likeExportPolicy
,IsMachineKey
, andKeyUsage
, work with anint
on the stack, passing a pointer to it to the OS, avoiding the need to allocate. - the crypto libraries often use a
CloneByteArray
helper. This helper was cloning empty arrays for no good reason. PublicKey
usesAsnEncodedData
was sometimes transferred ownership of temporary instances, but it still cloned them.AddEntry
now takesReadOnlySpan<string>
(spans in all the places!) instead ofstring[]
. Such call sites use stack space to store the strings passed toAddEntry
instead of heap.- The
OidCollection
didn’t have a way to specify the size of the collection to avoid growth overhead even though many of the callers did know the exact required size. Likewise,CborWriter
lacked the ability to presize. CborWriter
was increasing its size by a fixed amount which led to O(N²) growth cost. It’s now on a doubling plan like other collections.- A few properties on
CngKey
(Algorithm
,AlgorithmGroup
, andProvider
) were memoized because the answer is always the same. “These are particularly expensive because the OS implementation of these functions needs to make a remote procedure call to another Windows process to access the relevant data.”
Networking
.NET 9 includes some improvements in steady state HTTP performance but also in TLS connection establishment.
- reduced allocations associated with the TLS handshake
- avoiding some unnecessary
SafeHandle
allocation - clients using client certificates can benefit from TLS resumption
- the HTTP connection pool synchronization mechanism uses an opportunistic layer of lockless synchronization in a
ConcurrentStack<T>
With a connection established the runtime:
- uses vectorized helpers
Ascii.FromUtf16
to write the request head - avoids extra async state machines needed only in rare logging cases
- removes allocations by computing and caching some bytes that need to be written out on every request
- special cases the most common media types used in
JsonContent
andStringContent
- special-cases
TryAddWithoutValidation
for multiple values provided byIList<string>
- avoids large
char[]
allocation in the parsing ofAlt-Svc
headers by usingArrayPool
The WebUtility
and HttpUtility
types both got more efficient:
HtmlEncode
begins using the faster helperSearchValues<char>
UrlEncode
similarly gets wins usingSearchValues<char>
UrlEncode
now usesstring.Create
and does its work in-placeUrlEncodeToBytes
andUrlDecodeToBytes
use stack space smaller inputs, and useSearchValues<byte>
to optimize the search for invalid bytesUrlPathEncode
, usesArrayPool
for memoryJavaScriptStringEncode
leveragesSearchValues
(SearchValues
for all!)ParseQueryString
usesstackalloc
for smaller input lengths, and usesstring.Substring
with span slicing
Other memor savings:
Uri
gainedTryEscapeDataString
andTryUnescapeDataString
, which store to spans; used inFormUrlEncodedContent
for space and speed gains
Improvements elsewhere helped out WebSockets with TryValidateUtf8
costs dropping significantly.
JSON
System.Text.Json
got several improvements in .NET 9.
“JsonSerializer.SerializeAsync
overloads that target PipeWriter
in addition to the existing overloads that target Stream.
That way, whether you have a Stream
or a PipeWriter
, JsonSerializer
will natively work with either.” ASP.NET uses System.IO.Pipelines
internally so no more shims.
JsonSerializer
gets allocation-free parsing solution for enums with the GetAlternateLookup
support mentioned earlier.
System.Text.Json
gets many other improvements:
JsonProperty.WriteTo
useswriter.WritePropertyName(Name)
but now it can directly write with UTF8 bytesBase64EncodeAndWrite
can now encode a sourceReadOnlySpan<byte>
directly into its destinationSpan<byte>
JsonNode.GetPath
avoidsList<string>
allocs extracting its path segments in the reverse order and “building the resulting path in stack space or an array rented from theArrayPool
.”JsonNode.ToString
andJsonNode.ToJsonString
use existing caches ofPooledByteBufferWriter
andUtf8JsonWriter
, avoiding allocsJsonObject
uses the count of items in its inputEnumerable
, if available, to pre-size its dictionaryJsonValue.CreateFromElement
accessesJsonElement.ValueKind
repeatedly to determine how to process the data; it was “tweaked” to only access the property onceJsonElement.GetRawText
can useJsonMarshal.GetRawUtf8Value
to return a span over the original data, no allocsUtf8JsonReader
andJsonSerializer
now support multiple top-level JSON objects from an input. No more pre-parsing to avoid errors due to multiple objects, and no more work-around code
Diagnostics
“System.Diagnostics.Metrics.Meter
Counter
and UpDownCounter
are often used for hot-path tracking of metrics like number of active or queued requests. In production environments, these instruments are frequently bombarded from multiple threads concurrently.”
This all needs to be thread-safe and highly scalable. In .NET 9 various lock-free patterns were used to accelerate these.
- Interlocked.CompareExchange to do the addition of doubles
- Use more than one counter to accumulate the total for less contention
- Cache align the doubles so that there is no false sharing of cache lines
More about false sharing in “Let’s Talk Parallel Programming in the Deep .NET series”
Measurement
gets a new constructor that takes a TagList
, avoiding the overhead associated with adding custom tags.
TagList
models a list of key/value pairs, but it has built in fields for a small number of tags without allocating an array. Now an [InlineArray]
is used, which enabling access via spans in all the cases.
Peanut Butter
Stephen’s “peanut butter” list is really all over the place. I’ll just briefly mention some of the topics here.
StreamWriter.Null
becomes an instance ofNullStreamWriter
NonCryptographicHashAlgorithm
changesAppend
to allocate a small temporaryStream
object that wraps thisNonCryptographicHashAlgorithm
instance for throughput benefitsvirtual
removed from “a smattering” ofinternal
members that didn’t need it- replaced a bunch of
Span<T>
instances withReadOnlySpan<T>
to reduce overhead associated with the covariance check - hundreds of fields were upgraded to
readonly
orconst
when possible - In
MemoryCache
only one of thread can trigger compaction at a time BinaryReader
needs extra allocations to read text, these are now pay for playArrayBufferWriter.Clear
sets the written count to 0 and also clears the underlying buffer. The newResetWrittenCount
only clears the written count.- The
File
class methods likeFile.WriteAllText
getSpan
based overloads. #spansinalltheplaces MailAddressCollection
uses a builder instead of string concat- The config source generator gets some changes to avoid unnecessary display class allocations for some lambdas
StreamOnSqlBytes
properly overridesRead
/Write
avoiding base class cost (this has happened often).NumberFormatInfo
uses singletons to initialize itsNumberGroupSizes
,CurrentGroupSizes
, andPercentGroupSizes
instead of new arraysColorTranslator
avoids many P/Invokes by getting all color components in one call (Windows only issue)
Acknowledgements
I have to thank Stephen for the amazing source material. This summary is roughly 5% the size of the full write up and I’m sure I have introduced some inaccuracies in the summarization but hopefully not too many. If there are any errors here, I’m sure they are my own.
If you’re looking for more clarity, or the supporting benchmark details, click through the section headers to get the full story. This article is a good gateway to the full material.