Several people have been asking me for some general advice on how to do performance analysis and so I thought I’d summarize some of the training material I’ve used in a brief article. I’m going to try to cover a variety of cases but, as usual, I’m also going to try to be brief which means the information will only be approximately correct. Please keep that in mind. :)
What your result looks like
It’s important to remember that the output of most performance analysis is a document. The point of that document is to succinctly describe the present situation, what is going wrong, why it’s going wrong, and what might be done about it. Sometimes you’ll be the one doing the work to improve things, especially on smaller projects, but as often as not it’s more efficient for you to summarize the problem for someone who knows the code better than you. That’s the most efficient way to do this.
The above also means that the best performance tools are the ones that allow you to create a compelling document that lucidly describes the problem(s) to someone who maybe isn’t an expert in performance analysis. Tools that do not create useful assets are far less interesting to an analyst. Keep that in mind if you are creating tools…
Where to Start
There’s two initial steps.
- Pick some kind of test scenario. This will include the type of system under test as well as the workload. You may end up doing several of these but it’s good to start somewhere. Pick workloads your customers care about.
- Take a very broad look at your scenario and get a sense of what kind of problems you might be looking at.
One of the most important things, no actually, the most important thing you have to do, is to understand which resource or resources are critical to the workload in question. There’s a formal notion of a “critical resource” but that’s less important than the general notion that some things will matter and some things don’t. You have to make it your business to understand what matters.
When you know what matters you can then pick the correct metrics and tools for your next steps.
The essential performance metrics are generally very easy to see. On Windows, the performance tab of Task Manager does the job. It’s almost like the performance team decided what to put on that tab…
The primary resources are CPU, Memory, Disk, Network, and GPU. Add Battery if you like.
Once you know which resource(s) matter you will ask yourself these two questions.
- Can I reduce the overall usage of this resource?
- Am I using this resource in an efficient manner?
Under overall usage, consider this: what portion of the usage actually corresponds to forward progress on the problem at hand? Is stuff happening that doesn’t need to happen? What is the minimum work needed to do the job if everything was perfect? How does my usage compare to that?
Under efficiency, consider the way in which the resource is being used. For instance, on most CPUs poor code locality results in much slower execution of the same instructions. Likewise poor disk locality results in tons of seeking and less actual reading of data. On networks, opening/closing connections results in wasted overhead. If you have to read 20MB from somewhere, fine, read it, in nice big chunks… reading it say 2B at a time will be bad…
With this in mind, here are some tips for each resource type.
Most people start with a profiler that will tell them about their code. Before you proceed, read the above, you want to make sure you actually have a CPU problem before using CPU tools.
If CPU usage is high then probably the best tool is a sampling profiler. It can give you high frequency call stacks showing you what code is running. And of course, code is running most of the time or you wouldn’t have a CPU problem. Did I mention that you shouldn’t be doing this if your CPU usage isn’t very high?
The simplest situation is that the stacks you gather will show you where your cost is and you can directly reduce that cost by changing your algorithm. However, even then it might not be obvious what to do. Sometimes the CPU usage looks like it is exactly where it needs to be. Then what?
- Instrumenting profilers can tell you which functions are called and what the call counts are, compare these counts to the size of your input and look for issues. If you have 1M characters and your getchar() method is being called 2M times that’s probably wrong… the counts are invaluable
- Low level instrumentation can tell you how many instructions you retired in your workload, dividing that by the cycle time can give you “CPI” cycles per instruction. A high CPI indicates that the CPU isn’t working very efficiently. Further instrumentation can tell you which factors are resulting in poor CPI (cache misses are probably the most common)
So the situation is either that your instructions per unit of input is too high, in which case you need a better algorithm, or your CPI is high in which case you probably need to touch less memory (denser structures might help).
Try some experiments to see if you can find what helps. Be prepared for the fact that your guesses are likely to be wrong at first.
The attack strategy is actually remarkably similar even though the problem is totally different.
You’re going to want to get a sense of what disk i/o is happening, a good profiler (like ETL/WPA) can give you call stacks for every i/o so you can see what’s driving the work. But even without call stacks you can often make a lot of progress just by looking at what files are being read.
Once you have that data you’re going to look for the same kinds of things as in a CPU analysis. “What am I reading?” “Is it stuff I Really Need To Read or is it waste?” At the same time you consider “How Am I Reading It?” “Do I read in big chunks?” “Am I seeking all over the place?” “How is the disk throughput comparing with the maximum the device is capable of?” If I’m reading in bad patterns then the total bytes will be low even though the disk is constantly busy.
You have to understand whether the problem is the amount you’re reading, or the way you’re reading it. Once you have that information, you can devise a strategy to make it better.
Note a lot of memory problems (like swapping etc.) are actually disk problems and they’ll show up as huge amounts of disk i/o. Likewise system services like “fetching image resources” or “reading the registry” turn into disk i/o and can ruin your life as easily as really “read()” calls.
And of course everything I just said about reads applies to writes too.
Again remarkably similar. What is the total volume of data? What does efficient transfer look like? What code is driving the network usage? What resources are being fetched?
Attack the problem by improving the way you use your network resources or reducing the overall usage by not fetching data you don’t really need.
Good profilers can show you the call stacks that are originating connections, reading, writing, and so forth.
This should be sounding familiar by now. There are some wrinkles.
Discrete GPUs have their own memory and can work on that memory without implicating the main memory system. To a point. All these systems require the original data to be transferred into the GPU at some point and that can become a problem.
Integrated GPUs don’t suffer from having to transfer data into GPU friendly memory but they have a much worse problem. The GPU is going to compete with the CPU for physical memory accesses. If your GPU is blending like crazy you may find there is little or no available memory bandwidth to do things like fetch instructions or fetch CPU data. This effectively slows your CPU to a crawl.
On many systems, especially phones, a good balance between letting the GPU do its work and letting the CPU do its work is essential.
The attack strategy is very similar as always. What work is the GPU doing? Where is it coming from? What is fundamentally necessary to do the job at hand? What is waste?
GPU analysis often looks at things like “pixel overdraw” meaning the same pixel was computed many times and only the most recent value was actually used. Algorithms that spend less time working on things that are ultimately occluded are invaluable. Algorithms with less blending create more opportunities for occlusion which in turn can help reduce overdraw…
There’s no one solution to minimizing GPU cost but the attack pattern is always the same. What’s needed? What did I do? Can I do less? Did I do my work efficiently?
Memory is entangled with many of these others as we just saw looking at GPU and I hinted at looking at CPU.
By now you know what I’m going to say: Use less, use it efficiently. But it’s important to know what’s really a memory problem and what’s some other problem driven by memory.
At its core, memory has a certain bandwidth, just like say disk or network, that defines the maximum amount of bytes you could possibly read from main memory. Actually achieving this can be tricky but it’s doable. It mainly requires reading big swaths of memory in order. Few programs actually do this (except maybe integrated GPU programs).
Let’s look at this through the usage and efficiency lenses I mentioned before and talk about some specific things that happen.
- You significantly exceed total available physical memory
If you do this the you actually no longer have a memory problem, you are now swapping and you have a disk problem. Disk problems are much crazier than memory problems. You will know this has happened because CPU usage will seem to go down, the CPU is waiting for the disk on all kinds of memory reads.
There are a couple of ways to attack this. You can reduce your total memory needs (this is always a good idea) or you can improve the locality of your memory usage, effectively reducing the needed physical memory. Any way you slice it you need to go on a physical memory diet.
2. You (try) to exceed the total memory bandwidth
If you have several threads running all of which are beating on the memory system, one or more of which could be coming from an integrated GPU then you are in a world of hurt. Any time the CPU needs more memory it may find itself waiting not just for the memory read-time but actually to get a turn to read. Memory latency will go well above read time in this case and your processor will run arbitrarily slow. But it will still look like it’s using 100% CPU time because this is not a context switch event from the perspective of the scheduler.
To fix this total memory demand must be reduced. Smaller textures. More cache hits.
3. You use your cache poorly
If you’re taking lots of cache misses then you are going to often suffer memory delays. Assuming competition for memory is light then these may be around 70ns or less per cache miss. Enough time to retire maybe 150 instructions on a modern processor? Maybe more? Well, it depends, but lots.
Again, poor cache locality is likely to present itself as a CPU problem, it translates to poor (high) CPI. This particular situation is perhaps one of the most common — poor CPI due to bad locality. It’s super common among OO programming languages because they encourage a sea of pointers with data structures that have pieces that may or may not be near each other and are full of big 64 bit pointers fattening them up for no good reason.
To improve this situation, eliminate more pointers, value rich data structures with high density of useful data are the best. Data structures like arrays and b-trees can pack a real punch. Don’t be afraid of a little extra per node computation if it saves you cache misses. You can do a lot of compares in the time it takes to read one cache line.
The very worst thing you can do is follow pointers around, reading entire cache lines so that you can get only a very small amount of data from each cache line. That burns your bandwidth and can easily get you on your way to problem #2 above if you do it on several threads.
If none of your resources are looking very busy then you probably have some contention. One way to think about this is as follow: any data that has been wrapped by a critical section is actually a software resource. It has wait time, a service queue, utilization percentage. All those same things.
Instrument your software resources so that you can understand their usage just like you would understand disk. Attack them the same way.
Many profilers let you look at call stacks for context switches, this is a great way to understand where your code is waiting.
The combination of CPU sample tracing and context switch tracing makes you Santa Claus. You see your system when its sleeping and you know when it’s awake…
I’m already much longer than I intended but if you read this far this summary should be self evident:
- Identify and consider the critical resource, choose tools that let you see it clearly (special note, if the wrong resource has become critical, fix it!)
- Look for wasted work. Anything that isn’t forward progress on the problem at hand, remove it! This is the stuff that shouldn’t be happening at all. Wasted work can come in the form of bad algorithms.
- Consider the ways you are using your resource, are they efficient? Is your access pattern a good one? If not change your patterns to be more friendly to the resource. Again, bad algorithms can drive this.
- Finally, document your findings and learn from them. Documenting also lets you get audits from colleagues and encourages you to be thorough.
Now you can performance too.