When I worked on the Edge Browser one of the problems we faced when looking at telemetry, and this seems to be especially prevalent when looking at reliability telemetry, is that the early results tended to skew too favorably. A similar thing tends to happen with performance data where the early returns often come from people with the best devices and so they skew favorably.
The above picture illustrates not-atypical results showing that early returns (solid line) are skewing positive compared to later results (dashed line) — lower is better in this metric. Now it turns out that in this result the new configuration really was better but it much less better than the early results indicated and it took several days before the overall returns were showing the true improvement.
In my experience this is a very common problem with early results in all kinds of metrics: power, reliability, performance, and many others.
Interestingly, we never seem to apply what I think is the most obvious solution: do not sample randomly, or aggregate everything — sample to get a representative population.
You see, in many cases, nearly all the cases I work on, there is actually tons of data available. Performance and reliability results with n > 10⁶ are not uncommon but many population segments are over-represented in the early results. But, yowsa, 10⁶ samples is a crazy large number. Who gets to do a study with n ~= 10⁶? This much data allows you to easily control the mix to get something more like reality.
To get something better, look at your typical overall mix of data over a variety of characteristics, and a large period of time. Maybe the important dimensions are CPU class, available memory, available disk, network speed, maybe some user demographics. Whatever they are, choose a much smaller non-random sample, say n = 10⁴ (still pretty big) that matches the normal data characteristics… You can do this much earlier in the results cycle and get meaningful comparisons sooner.
Population skew is nearly universally present; even week-over-week data is sometimes skewed in this way. Maybe it’s a holiday week and there was less time for users to upgrade so better devices are over-represented. Maybe the data is highly seasonal on a weekly basis so “Tuesday” is over-represented this week because of a slightly late release, or an early release, or whatever. Controlling for samples from certain days of the week could be as important as controlling for device class.
In fact, if you’re trying to decide if your product is better or worse on any given week you really have to control for population mix so that you get some kind of apples to apples comparison no matter the metric.
This kind of population control is excellent for other kinds of experiments too. It’s often the case that the thing you intend to try was actually experienced, by happenstance, by some fraction of your users recently. This could give you invaluable insight into the effect your intended change tends to have. I first saw this years ago working on MSN advertising: when considering a new ad placement strategy like “More larger ads will be better” we could look at the previous week’s data and decide if “Among those that got the larger ads, at random, did they do better?” The same approach works for a lot of different situations, e.g. “If we made an extra 10M of memory available on average would it help metric X?” You can look at those people who (by happenstance of device load) had a little extra memory and ask “Did those people do better on metric X?”
Regularly controlling your population for reporting will get you thinking about your population characteristics all the time. This is also a good thing because if your population mix is changing you want to know about it. Maybe you’re gaining or losing important audience in different segments, or on different devices. This can be hugely valuable when deciding what marketing and engineering steps you might want to take.
In short, thinking about your changing population and controlling for those changes is vital to accurately understanding the effects of your engineering efforts, and other efforts, in a timely fashion. It’s still not perfect but it helps.