How I Approach Performance Investigations
I’m often asked about doing performance investigations. I’ve done thousands of these over the years, and I tend to give people the same general advice. Sure, every case is different — the technology matters, the particulars matter — but there are a lot of similarities, too. I think there’s a general approach that works pretty well, so I’ll go over what I typically tell people when they ask, “How do you do this?”
Start With a Workload You Understand
The first thing is: you need a workload you can understand. This helps you get a sense of what the problem actually is. What are we looking for? What does success look like? Where are we now? From a customer perspective, what does “good” mean?
Often, the metrics customers care about aren’t the ones most useful to you during analysis — but they do matter. Ultimately, “if the customer doesn’t like it, then it doesn’t matter what you did.” You have to make a difference from the customer’s point of view.
What Customers Care About
Customers might care about latency, throughput, heat, battery life — who knows. That’s just the performance side of it. Of course, “it’s easy to make it fast if it doesn’t have to work.” You still have to maintain correctness. Maybe there’s some wiggle room for errors, maybe not, but either way, correctness needs to be at least preserved.
Latency Is a Dead End (Sometimes)
People often get excited about time. I don’t like doing performance investigations based solely on time. Time is not usually probative. Sure, customers care about latency, and sometimes it’s the way you explain improvements — but latency data often doesn’t tell you what to do next.
I like to call latency data “data I can only cry about.” You regressed by 50 milliseconds? Boohoo. Now what? Time is a symptom of the problem, not the problem.
Consumption Metrics: Where the Real Clues Are
Instead, look at consumption metrics. That’s where you can the improvements usually come from. Windows Task Manager (for instance) shows you CPU, disk, network, memory, and GPU for a reason — these are the key consumption metrics on modern processors. They tend to be actionable.
Before you grab a complicated profiler, just look at these numbers:
- High CPU? Okay, maybe it’s a compute-bound problem.
- High disk I/O? What files are you reading/writing? In what order? Are you seeking all over the place?
- High network? What ports? What volume? What sizes?
These metrics give you follow-up actions. You can do something with this kind of information.
Cardinality and Consumption
Cardinality is important. Ask: What does my input look like? What does the problem look like?
If you’re reading megabytes from disk and your input files are 500KB, something’s off. Where are the extra reads coming from?
I used to work on browsers. Sometimes I’d see extra CPU in the layout code — but that code hadn’t changed in weeks. So why more CPU? Turns out, upstream code made more of the box tree dirty. So even though the layout code was unchanged, it was doing a lot more work.
That tiny bug upstream — maybe just a few microseconds of work — caused a big problem. It wouldn’t show up in a CPU profiler, but you can catch it by instrumenting the right parts. For example, measure how many boxes are laid out. If that number jumps, something’s dirtying boxes that shouldn’t be. Get to the bottom of what code is responsible for generating the new work.
Remember: “It’s easy to make it fast if it doesn’t have to work.” If the layout was broken before, and now it’s correct but slower, you’ve had a correctness issue. That’s still a win, even if it comes with a performance cost. And now that it works correctly, maybe you can make it faster properly.
Are You Getting Good Mileage?
You can be at 100% CPU and still be inefficient. Consider how well your processor is being used. One way is to look at cycles per instruction (CPI):
- A great CPI is close to 1.
- If you’re seeing 5, 15, 20… that’s probably bad.
High CPI means you’re probably missing cache, mispredicting branches, or worst yet hitting page faults. The processor isn’t being used effectively. Maybe your algorithm is fine, but your data structures aren’t. Too many pointers? Try using more value-oriented structures.
A single cache miss can cost 150–200 instructions. A page fault? More like a million. These costs add up fast.
Don’t Look at the Wrong Thing
You’ll waste a lot of time if you’re looking at a CPU profiler and your problem is actually in disk access. Make sure you’re measuring the right thing. Only go to detailed profilers after you’ve identified where the real pain is.
Understand Orchestration
Now, let’s talk about orchestration — the pattern of execution.
- Could (more) operations be parallel?
- Are you waiting for things in the wrong order?
In a web browser, for example, you want to download CSS before images. Without CSS, you can’t do layout. This is orchestration: doing the right things in the right order, overlapping where possible.
Orchestration problems come in many shapes. If your resource usage is reasonable but latency is still bad this is a place to look. Remember that any resource can be the source of an orchestration problem and any time you put a lock around a data structure you have created a software resource. It has wait time, processing time, etc. just like hardware resources.
Too Many Threads Spoil the Cache
Another classic orchestration problem: way too many threads.
Sometimes people create thousands — even tens of thousands — of threads. This is rarely clever. If you’ve got more threads than CPU cores, you have to ask yourself “Who am I kidding?” those threads aren’t running. You’re just creating asynchronous behavior using a heavyweight mechanism.
Use asynchronous patterns instead. Task queues are way cheaper than threads. Threads require stacks, OS overhead, and all that leads to poor cache behavior. The more you context switch, the more you cold-start your cache. Your TLB. Everything. You could be taking a huge penalty.
Use queues can make a huge difference. If you assign threads to own specific resources you make that code look single-threaded, and sharing becomes simpler. Scheduling becomes clearer. You can implement pushback, drop work on the floor, or prioritize. All of this is harder when you’re drowning in threads.
Managed Workloads and Pointer Discipline
In managed environments, pointer discipline matters. Long-lived data structures should be low in pointers and rich in values.
When I worked on Visual Studio, I used to tell people: “Model your memory like you were going to store it in a database.” Durable data structures should look more like B-trees than (e.g.) n-ary trees. Dense, simple structures help with garbage collection, cache locality, and concurrency.
Avoid objects — identity everywhere forces you to use pointers, and those don’t localize well. Bad locality leads to high CPI. Bad CPI is nerfing your processor back into 2002.
Know What “Good” Looks Like
Always compare your usage patterns against what would be “a good job” for the workload:
- Don’t send tiny batches to the GPU.
- Don’t do microscopic reads from disk.
- Don’t create a thread for every small task.
- Use dense memory patterns.
Most performance problems fall into a few simple categories. If you pick the right tool and run the right experiment, they’re solvable.
Final Advice
- Always look for the simplest experiment to guide your next step.
- Make lots of hypotheses, expect many to be wrong.
- Verify incrementally.
- Be humble about your theories, I like to say, “If you’re very good at performance you’re only wrong 95% of the time.”
This kind of pragmatic approach is not so hard in practice and usualy leads to good results.
Happy Hunting!