Understanding the Shape of Your Data
From time to time it’s common to have a discussion about metrics. These can be all kinds of different metrics. It could be anything from CPU usage, to latency, to remediation time. Many metrics need to be considered in the course of any given business, and they may need good goals.
A classic problem with metrics is that there are “outliers” in the system that skew the results. Maybe there’s bad cases out there that would throw off a mean or something like that. It’s important to have good goals so that people who are working hard can see the progress that they are making in a meaningful way. As a consequence, there is sometimes a temptation to disregard some of the data, or perhaps use a metric that is less sensitive to noise or extrema like say P50. Sometimes people reach for P75 and P90 as well to get a more rounded perspective.
First, looking at more percentiles is of course a good thing. Any one percentile gives you a very incomplete picture of the world. But actually the message that I want to convey here is that percentiles generally tend to hide very useful information and the center of your attention should be the distribution of your metric. Only after you have studied your distribution can you reasonably select one or a few simple summary statistics that tell the story of your metric clearly.
To make this more apparent I have prepared a few pictures. As you will see all these pictures have the same P50, P75, and P90 by construction.
Let’s start with the “baseline” picture:
Let’s pretend these are latencies. The first image shows 100 hypothetical observed latencies order by size. We can read the P50, P75, and P90 right off the chart as there are 100 points. You can easily see that there are linear sections with fixed slope up to P50 then a new slope to P75, then another slope to P90 and finally a slope to maximum value. The histogram shows the count of values in each bucket. In the interest of limiting clutter, I have only labelled the max value in both y axes because really only the shape matters for this discussion.
The histogram shows this gradual decay to higher values pretty clearly. And the P50, P75, and P90 are 12.5, 20.83, and 28.33 respectively. You’ll be seeing those a lot.
Let’s make a slight change to the data, we’ll do this a couple of times.
As you can easily see looking at the overall data, this sample is materially worse. We now have minimum latency that is actually half of what the P50 used to be. Of course, P50 has not changed, and the rest of the data has not changed.
Now thinking about this from a work perspective. If we were making changes in a code base or a process what we see is that by using P50 we can commit any crime we like down at say P25 and it won’t affect our metrics. Likewise, any improvements before P50 do not appear at all. These could be very important improvements or regressions. And symmetrically, any crime beyond the last percentile examined is invisible. For instance P95 could be arbitrarily worsened.
Contrariwise, looking at the distribution (even a 10 bucket histogram) makes this change trivially pop. It’s easy to show what happened, and the nature of the change might even be probative in terms of figuring out the problem.
Let’s try another one.
Things are now even messier. There are now several linear regimes of different slopes. The distribution is interesting because there are two common latencies, corresponding to the flatter slope areas. The histogram shows this with peaks in the 1st and 4th bucket.
Now real data is typically not a linear spline and many kinds of changes you can make to a system are going to affect the response everywhere not just in one regime. But this is not universally true and indeed understanding the various parts of your distribution can be very helpful. For instance, you might see a bulge of service times after some few hundred milliseconds and upon reflection you might think “oh those are the times when our DNS cache is stale” this can tell you something about how to make things better.
Now, once you’ve made an improvement, or suffered a regression you might be able to summarize the effect very succinctly: “P50 is way worse” or “our best-case suffered as we can see in P25”, or whatever the case may be.
However, if you choose any given small set of percentiles. you will always be blind to important wins, or losses, and you lose the insights you might get by understanding shape. Demonstrating wins by showing differences in the shape is also very satisfying: “Look at how all those outliers clumped right up and shifted left.”
It’s worth noting that in many (most?) cases, the outliers are not actually anomalies but rather just part of the service norms. Sometimes things go wrong and a retry is required, sometimes some system is in a not ready state and there is a delay. You can see these effects in the distribution, and they are part of your operation. A second “bump” in an otherwise smooth fall off could be very telling “What is going on at 2 seconds?”
One final observation: this last distribution is actually the best of the bunch despite it being weird. It’s weirdly… good? A change like this pulling in a lot of cases into the low end should be celebrated.
Milk the distributions and use summary statistics only after you know that they tell the story accurately.