“If you’re really good at software performance engineering you’re only wrong about 95% of the time”
If you’re not good, it’s worse.
This is one of my favorite aphorisms and I throw it in every time I teach performance engineering to people. I didn’t do an actual study to get an exact number — obviously — it’s just supposed to come off the tongue easily. But I think it’s approximately correct, or at least approximately correct in my experience which is all I speak to anyway.
Why is this useful advice? Because it helps you to understand what good habits are for your performance projects and also maybe helps dispel a little imposter syndrome. And now you’re saying: “What are these good habits you speak of Rico?” I’m glad you asked.
0. Refer to your whatever-it-is as your latest crazy idea.
I know this has nothing to do with anything, but it will help you to get in the right frame of mind. Be suspicious of yourself. When discussing with others don’t sound like you’re committed to this one approach. As you collaborate let others refine and reject some or all of your crazy idea. If you’re good, your crazy idea will mutate (in maybe 20 steps) into an actually-not-that-crazy idea. Don’t worry, nobody will remember the earliest and craziest forms and you don’t have to document them unless you think they have value for humor or education.
1. “Does this crazy idea even work at all?” — try it out on the dumb
As soon as possible try out your crazy idea. Find the stupidest experiment you can do that gives you evidence that this crazy thing actually might be useful. The key is KISS.
A simple example (we’ll use this throughout): You have a complicated project that initializes many systems at startup. Some of them are rarely used. You propose to not initialize some of them until they are needed.
Example sanity check: comment (or stub) out the code that initializes one of the systems that you think you can defer. You should still be able to run for quite some time before anything notices or else you are totally wrong about this thing not being used.
2. “What could we possibly save?” — get a number, or numbers
Now that you have some prototype that shows that this crazy idea has a hope of working, let’s do a measurement. The job now is to build something on the cheap that gives you an idea what you could possibly save in your wildest dreams. Note: the code doesn’t have to really work at this point, in fact it usually won’t! The point here is not to get it fully working but to just try things out and see what we’re even talking about.
Why? Well, suppose you do this experiment, and you see that in your “it-can’t-be-better-than-this test” you save 2ms. Well, now you have some idea how valuable this crazy idea can be. Is 2ms worth a person-week? Many person-weeks? More? Less? You need to have some idea because as the work you need to do to actually turn this crazy idea into reality comes into focus you need to be able to say, “this is worth it” or, “we should look elsewhere.”
When you’re thinking about how much of a report to write or how many colleagues to even bother with your idea, you should already have this number. If a lot of people will be required to turn your crazy idea into reality it must be worth the bother.
Going back to the previous example of deferring initialization of some system, all we have to do is add some metrics around the region we had commented out and also the overall region that contains this stuff. Why? Because removing some initialization doesn’t just remove the cost of doing the initialization it also saves you memory, saves you from dirtying your memory cache, disk cache and whatnot. When you remove work, it’s common for there to be direct savings and indirect savings. You want to know about both. You also want to know if that thing you deferred ends up being initialized by something else 2ms later in the same path (potentially at greater cost).
Remember: you are likely not able to deploy these experiments into real production environments, so you’ll have to use past experience to convert whatever lab results you get into projections. e.g. “Saving 1M of allocations at startup tends to save us about 25,000 demand zero faults. Those are about 2us a pop so we’re looking at about 50ms for customers.” If you don’t have this kind of intuition/experience based on data, you need to get that or you’re just making stuff up.
3. “What could possibly go wrong?” — make a list
Another favorite aphorism of mine is “It’s easy to make the code fast if it doesn’t have to work.” Now add this one “Performance improvements that add a lot of complexity are rarely a good idea”. With these two notions you have powerful guidance.
We need to understand what it is we’re about to do to the system. What new failure modes we might add, what new operational complexities, what new correctness issues, synchronization issues, orchestration issues, and pretty much anything else you can think of.
A good way to start is to propose the simplest thing you can think of that has any hope of working and then get some criticisms. This is where having colleagues that know your system well will help you. You can give them a sense of the urgency and value with the data you have up to this point. A huge potential win merits much discussion. Conversely, small wins should be discarded quickly.
As you learn the realities of making your crazy idea real, you will likely need to evolve it many times. This is ok, it’s normal. You’re looking for ways to get your win with the fewest side-effects — remember there are no bonus points for complexity in performance work, there are only penalty points. Well, that is probably universally true, but how about especially in performance work complexity is strictly a penalty. The best performance work simplifies a system, not the reverse.
Let’s look at our example again: We propose to defer initialization of a system — what can go wrong? Well, when will we need to finally do this deferred initialization? Will we be on the right thread? Will the initialization cost at that time be similar to the original cost? Worse? Better? Will you need additional resources, like maybe to do it later you need lots of locking? Will the lazy initialization affect other systems’ performance and/or state by causing interference? How will you identify all the places the lazy initialization has to happen? How will you keep those correct if new ones appear or old ones become unnecessary?
Keeping in mind that at this point you probably want to have zero lines of code checked in — we’re still talking, diagramming, documenting, and maybe hacking together some samples for measurement. Many crazy ideas meet their doom at this stage. You want to get them doomed-already as fast as possible.
4. “How do we try this out?” — make a plan
If you’ve gotten this far your idea probably looks only vaguely like it originally did. You now have a simpler approach that has a chance of working. The next thing you want to do is start making a plan of baby steps to take towards making this real. If this were a battle this is the part where you figure out what weapons you need and how you will get them and use them. There is still no “fighting”.
By the end of this stage, you should have a strategy to roll out the work that makes sense, that has a suitable cost, and that has many points where you can abort if things are not going the way you expected. You may have to reset all the way back to step 1 or 2 — or even abandon your crazy idea entirely.
Going back to the example yet again, there are probably several systems you could avoid initializing. Pick one. Pick one that’s maybe not the hardest but that has enough of the tricky issues that it’s a good example. Figure out a reasonable order for the others. Figure out what tests you need to add to make sure you aren’t breaking things along the way. Figure out what to look for as trouble signs: what’s smoke and what’s fire.
At this point you should be able to explain to anyone what your idea is and how you intend to proceed. This is the point at which you usually have to talk to some leadership to get resources committed.
5. “Can we code yet?” — start working and be careful
By now you have a good idea what wins you are expecting and how you’re going to get them. You can start doing the work in an orderly fashion while verifying that things are going as you expect. Do not ever make the mistake of saying you will have no way to know if your crazy idea is any good until you’re finished — that’s lame. There should be clear signs that things are going swimmingly at each new stage of development. Some are indirect, that’s ok. Some are only lab measurements, that’s ok. No checks along the way is not ok.
There are a number of things you can verify:
- the wins you’re seeing in the real code are in the same ballpark as the early dumb tests
- the ideas you had for maintaining correctness and simplicity are working as intended
- the tests you added for the same are passing
- no new significant problems are popping up that are making you re-think the whole idea
You can do these checks at many stages, with just a few extra measurements. Some of the numbers are not going to be final because you are almost certainly not deploying into customer environments with all their noise and complexity. You won’t know the exact truth of how good your crazy idea is until it’s really out there, but you should have a good idea that it’s working as intended.
6. “How did we do?” — time to test the pudding
At the end of the day, you get the win, or you don’t; but by the time you’re at this stage you really are unlikely to be surprised. Most things that could go wrong were ruled out far earlier. We of course still do get final results because this is where we find out how good our projections were. If we’re wrong at this point, it won’t be the most economical lesson, but there will still be things to learn.
If you’re lucky you can A/B test the thing. But that’s not always possible. Examples:
Can A/B: Deferred initialization: Ship in such a way that x% of customers are doing the deferred initialization; we can see if they get the expected improvement and/or have crazy bugs we didn’t expect
Can’t A/B: New cool page layout: If we re-order our binary in this new super-cool way we’ll get 5% fewer page faults on startup and hence a nice startup win; it could easily be impossible to ship two separate binaries.
So yeah, you won’t really know for sure until you really roll out your crazy idea, but you will have lots of interim steps and reasons to be confident. When you look at your final result, you’ll be able to see how it compares with expectations. You should be able to say, “nailed it”, “meh”, “omg bbq!”, or “well, poop.” But “nailed it, just like we planned” is really the optimal outcome.
For simple ideas you can breeze through stages 1–4. But then, nobody asks me for advice on how to do perf work on simple ideas…