Analysis: Prefer Consumption Metrics to Time Metrics

This is one of the most important lessons any performance professional can learn. I say it a lot of different ways to try to make it memorable but the long and the short of it is this: (elapsed) time is my least favorite metric.

Why?

The most important reason is that even if you’re watching it like a hawk, when it goes south, it gives you very little insight. I mean it helps if you have many times and split times and whatnot. And of course time is what customers care about, so you really do have to keep your eye on it. But when it goes south, what do you do? You won’t be getting any answers from a time metric because it isn’t actually a primary metric… the first aphorism:

“Elapsed time is data I can only cry about.”

So what’s better? Broadly it’s what I call consumption metrics. Understanding the consumption gives you real insight into how your program is using machine resources and what is important, and also what is going south. The most important consumption metric is the one that is associated with your applications critical resource. The one most in demand, the one that is going to be determining your overall performance. Bringing me to the second aphorism:

“Measure your critical resource. It is, after all, critical.”

Seems obvious and yet I can’t tell you how many people will waste hours of their life trying to understand elapsed time… Maybe we should skip to aphorism number three:

“Time is for rookies.”

What are some good examples of consumption metrics?

  • disk (read ops, write ops, total bytes, queue length, etc.)
  • cpu (instructions retired, branch predicts, even maybe cpu time)
  • memory (l2, tlb, page faults esp. as they relate to disk above)
  • network (round trips, total bytes, etc.)
  • gpu (texture sizes, vertices, shader cost, etc.)
  • cardinality (# of draws, # of layouts, # of renders, # of etc.)
  • other resources, especially software resources
  • there are hundreds more…

These things really tell you stuff… like if you notice total instructions retired went up and also the number of frames painted in a scenario increased this can be super helpful in getting to the bottom of what happened. Something is driving more paints. The paint code probably isn’t any worse. The consumption information helps in a way that “it’s taking longer to finish” just can’t.

Many scenarios have memory problems (side note: high page faults isn’t actually a memory problem, it’s probably an i/o problem), do you have good locality? If you’re suddenly visiting a lot more memory to get the job done you are likely to notice a jump in L2 cache misses. Alternatively, if memory and other resources are fixed, but total instructions has increased, then you’re looking for something that added a new, long, code path…

Another great thing about consumption metrics is that many of them are invariant across devices and frequently much more stable — and therefore more useful in a lab setting. Total disk bytes read for instance could be very hardware insensitive. Total frames painted might be similar across a variety of devices. Lab results for either of these could be invaluable in finding regressions.

Time is one of the noisiest metrics and is subject to all kinds of interference. But most important of all, it is rarely directly probative… Contrariwise, in lab settings, time metrics can be insensitive to many issues users will face. Even if the i/o is super cheap in the lab (because, e.g. it is 100% cached) measuring the consumption will tell you something about what is going to happen in the real world. Things that would otherwise not be flagged as possible regressions, tend to pop.

In short, you get more sensitive data, and more probative data, if you measure the primary metrics: consumption metrics. The things that most closely correlate to the critical resource and invariable come from the consumption family.

One last note on software resources. Every time you put a critical section around something, or really any kind of locking system, you have just created a software resource. It will have a queue length, and average service time, some rate of traffic, stuff like that. In short it will look a lot like say “disk” above. That’s totally fine… go ahead an model it like that. If it’s important, dare I say, “critical”, then you should be thinking about it just as carefully as you might consider the others.

Looking at consumption and consumption patterns will help you to get to the bottom of your problems. Am I using too much of something, or am I just using it in an inefficient manner? Do I have contention? Am I bandwidth limited (on e.g. memory, or disk)?

Those answers, well plotted, will tell you what you need to know. Lacking them… well… good luck greenhorn :)

I’m a software engineer at Facebook; I specialize in software performance engineering and programming tools generally. I survived Microsoft from 1988 to 2017.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store