General Guidelines for Software Performance Engineering in C++

Rico Mariani
8 min readApr 15, 2024

I wrote these as sort of generic guidelines for the C++ Core Guidelines document. I don’t know if they will end up there, but I thought they were interesting so here they are in this initial form. It might be interesting to see how they end up after others have a chance to look at them and add their own wisdom.

Understand your requirements and design accordingly.

C++ is not the most common choice for situations with no performance demands at all, but not all performance demands are the same. Before you can make any sensible choices, you should understand what success looks like in your own domain. Your requirements likely include some of binary size, latency, predictability, power, egress, and perhaps other more exotic metrics. What your needs are can and should color the choices you make. You will likely not be able to “optimize” your way to success if your major choices are actually totally unsuitable for your goals. The best practice is to make a sketch of what you intend to do, however complex or simple, and then ask yourself, “Is this likely to meet my needs?” If it is, proceed. If it isn’t, dig a little more, understand which things will matter, and invest and/or experiment accordingly. When you have a plan that holds water, start implementing it and check it as you go. This kind of simple analysis is never premature. If your requirements are easily achievable, you will be able to dispense with this process in a few seconds.

Optimize by measuring and simplifying according to your needs.

Many people believe that optimized code is uglier. This is flatly false; the most common and most successful optimizations involve simplification, not elaboration. Making your design more complex is almost never the right answer. Elaborate code is usually larger, slower, and more power-hungry than simple code.

When you choose to rewrite a piece of code to increase its performance, do so because one of your important metrics is telling you to do so. Check these metrics often, even preliminary versions, to make sure you’re on track. You can do this quickly and act before something very bad has happened. “It looks good so far” is a great place to be. You do not want to get to your finish line only to discover that something is horribly wrong and you have to diagnose it as a whole.

You do not have to boil out every last percentage immediately; you can often go wrong trying to do this too soon. But you also do not want to be off by orders of magnitude, using an entirely inappropriate technique, a dependency you can’t possibly afford, or completely disregarding a key metric either. Regularly verifying that you are on track is never premature.

Don Knuth popularized this quote by Tony Hoare, in full “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”

Small efficiencies can wait 97% of the time. Reasonable choices that give you a good chance of success cannot wait. If you believe Sir Hoare — which is wise — 3% of the time small choices will burn you. That’s about 1 or 2 per screenful of code.

Consider the causes of performance problems, not the effects.

Time is a secondary metric. Large time values are an effect of either bad orchestration, large resource consumption (read too much, use an O(N²) algorithm instead of an O(N) algorithm), or inefficient use of a resource (i.e., asking it to do something it’s bad at, such as seeking all over, or dealing with non-local memory access). Typically secondary metrics like time are not directly helpful in making a diagnosis — they only tell you the code is taking too long to run. In contrast, primary metrics like “how many instructions ran”, “what fraction of branches were predicted”, or “how many bytes were read” and other “consumption” metrics are often very helpful when it comes to diagnosing your performance problems and determining how you might fix them.

Likewise, if poor orchestration is the problem (work spread across too many threads, too few threads, too much work issued at once, etc.), mapping out your orchestration will be very helpful. The total time is likely not actionable directly. Insights rarely come from secondary metrics. Typically, they tell you where to look further at best; you then need primary metrics to make improvements.

Keep in mind that you should understand how your system performs according to your goals, not just its latency. A system might be plenty fast enough and still use more memory than you can afford. Or more power than you can afford. Or be too big to reasonably download. If you are paying for CPU cycles, you are likely to be very excited about reducing cycles by 50% even if your customers already think the system is fast enough. Cycles are cost.

Whatever your system may be, referring back to your needs and your customer needs will tell you where you should be investing, and which investments are worthwhile. At some point, the potential benefits are not worth the opportunity cost of doing them. An extra 10% reduction of download size may not be worth the engineering time because you’ve reached a size that is small enough that your customers no longer notice the difference. But if your needs also include egress costs for those downloads, then maybe that same 10% is worth it. The very same potential improvement might be a great idea or a silly one depending on your needs.

Simpler is generally better

Assuming you have done the basics of making sure that your overall design is approximately right, such as using O(N) if it is available rather than O(N²) or something like that, then you’ll find that simplifying code and data is the best way to make things faster. This means:

  • Don’t create complex inheritance graphs.
  • Don’t use complicated dispatch logic if simple logic will do.
  • Don’t make data structures with lots of pointers.
  • Don’t use shared pointers if unique pointers will do.
  • Don’t use patterns that reduce your average code and data density.

In short, consider the problem you need to solve and solve it as directly as you reasonably can. Don’t completely focus on just your immediate needs; some future-proofing is usually necessary, but don’t ruin the code by over-generalizing it either. Remember that in order to be re-usable, code has to first be usable.

This is a bit contentious, but in practice, object-oriented programming generally results in the worst performance. Consider work-oriented programming instead. What has to happen? Do this clearly, simply, directly, with only as much abstraction as would be helpful.

Arrays/vectors are probably the most under-utilized and most important data structure. Use simple proven data structures a lot.

Don’t make claims about performance without measurements.

The field of performance is littered with myth and bogus folklore. Reality can be humbling, and I’ve often quipped: “If you’re very good at software performance engineering, you’re only wrong about 95% of the time.”

Measure. Experiment. Verify. Accept or discard in a passionless way. Repeat as needed. When needed.

Remember you will not have any idea what you should measure if you don’t at least understand your needs. Don’t focus on CPU usage per transaction if what matters to you is download size.

Consider your choices holistically.

The most challenging performance problems cannot be addressed by hot-spot tuning after the fact. They come as a consequence of the choices made in the design as a whole. In a performance analysis, the cost may be smeared evenly over hundreds or thousands of functions and give the appearance that your processor is running in molasses rather than fine motor oil.

You will find that some choices that are perfectly fine when used sparingly are very problematic if they are used universally. For instance, there’s nothing especially wrong with std::function, but if its use permeates your code, you might discover that the costs associated with functor maintenance are adding up. All those little copy operations, optimized as they are, are adding tax all over your code.

A similar thing might happen if you simply used std::shared_ptr everywhere just for uniformity. Every pointer would be a little bigger, every function would have a few more interlocked operations, and a few more implicit destructors. What does that do to your code overall? Some flat tax on the efficacy of the CPU, maybe. It might be hard to see without a lot of looking at disassembly and micro-architectural counters. Will it matter? It depends. The answer is always “it depends.”

Should you therefore avoid these things and roll your own? Probably not — shared_ptr does its job about as well as it can be done. The better advice is to use the abstractions you need and avoid expensive abstractions if cheaper ones will do the job. For instance, a class with five shared_ptr members raises an immediate question: “Do these five members really all have separate independently shareable lifetimes?” Probably one shared pointer for the whole class with five unique pointer members would have done the job. Simpler is almost always better. Things that can easily be generalized later should probably be generalized later, if at all. Premature generalization is also a root of evil.

Overgeneralization can happen in many areas; representation is a key one. C++ can represent very rich, dense data types. Use that to your advantage! If you choose to represent your key data structures in some non-local form that is full of pointers, you will discover that you are paying a tax for all those pointers. A good rule of thumb for understanding the quality of a data structure goes something like this: Make a diagram of your key structures — every time you use an arrow, give yourself a demerit. Every arrow counts — even the tiny, short harmless-looking ones that seem like they couldn’t hurt a fly. 64-bit pointers are big and are often the crux of the worst processor efficiency problems.

Patterns that appear everywhere have to be especially good at their job. Remember to consider your performance needs when making a big decision like what the primary representation of your data will be. Density is a crucial factor in any design. And again, object-oriented patterns tend to be the worst for performance because there are pointers all over the place, and those pointers lead to poor density and worse locality.

Frequently, key decisions, like your core representations, cannot be revisited economically after the fact. These choices have wide-ranging effects on the code and are unlikely to be addressable by hot-spot analysis. A bad choice might impose significant tax on every data operation everywhere in your program. Most programs cannot afford a flat tax on any key system.

Design to enable optimization.

We prefer to start with a design that has, shall we say, the right “shape” from the outset. But this leaves many factors to resolve later — those small efficiencies Hoare mentioned, 97% of which we can ignore. But the 3% still might kill us. We don’t want to have to redo everything when plug in those few bits that we deferred.

Point optimizations should not change the overall shape of your design. We really want to start with something that is substantially likely to work from the outset. But you can defer a lot of choices if you make your system sufficiently modular. If you’re not sure what all of your requirements might be, some things (usually not binary size) might be constructed in such a fashion that they are easily replaced. This doesn’t require that you adopt anything expensive; you may even shun a cost like interface dispatch because of the nature of your contract (e.g., shading polygons). But with some thought, you can almost always create some clean contract that allows you to change parts of your design without undue cost and without unnecessary complexity. Your needs will drive these choices.



Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.