On adopting high end perf tools to study micro-architectural phenomena

Rico Mariani
2 min readApr 17, 2022

[I’m moving some my more interesting old blogs from web archive to someplace they can actually be found. Keep in mind these were written a lot time ago.]

  • 08/29/2014

Huge words of caution: you can bury yourself in this kind of stuff forever and for my money it is rarely the way to go. It’s helpful to know where you stand on CPI for instance but it’s much more typical to get results by observing that you (e.g.) have a ton of cache misses and therefore should use less memory. Using less memory is always a good thing.

You could do meaningful analysis for a very long time without resorting to micro-architectural phenomena simply by studying where your CPU goes.

It is not only the case that (e.g.) ARM does things differently than (e.g.) x86 products, it is also the case that every x86 processor family you have ever heard of does it differently than every other one you have ever heard of. But that turns out to be not that important for the most part. Because the chief observations like “we branch too much” are true universally. Just as “we use too much memory” is basically universally true.

The stock observations that you should:

1. Use less memory
2. Use fewer pointers and denser data structures
3. Not jump around so much

Are essentially universally true. The question really comes down to what can you get away with on any given processor (because its systems will save the day for you). But even that is a bit of a lie, because the next question is “what else could [the system] be doing and your program would still run well?” because the fact is there is always other stuff going on and if you minimize your use of CPU resources generally you will be a better citizen overall.

In short, the top level metrics, CPU, Disk, Memory, Network, will get your very far indeed without resorting to mis-predicts and the like. If you want to use the tools effectively, with broad results, I strongly recommend that you target the most important metrics, like L2 cache misses, and reduce them. That’s always good. Pay much less attention to the specific wall-clock consequence in lab scenarios and instead focus on reducing your overall consumption.

And naturally this advice must be tempered with focus on your customers actual problems and forgive me for being only approximately correct in 400 words or less.



Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.