Logging Tips: How much and which metrics

This is a pretty common question and actually there’s pretty good general engineering guidance on this kind of thing. Of course YMMV so please don’t take what follows as absolutes, but if you’re thinking about building something new, or rebuilding, just generally revisiting your logging what follows could be very useful:

1. Operations are not equally important

You could add all kinds of begin/end pairs all over the place, and likely you would go out of your mind. It’s important to keep things sane and choose the places that are most important. How to do this? I like to lean on my “letter grade” performance estimates.

Performance needs for any given feature/operation vary, some things really need to be an “A+” that’s world class performance. The best the machine could possibly give you. But, actually, there’s comparatively few of those and that’s a good thing because getting an “A+” is really hard. Most things just an “A”, an “A-”, a “B”, or even a “C” is just fine. In fact “C” is probably the most common grade. It doesn’t have to be great, it’s not that important, it needs to be pretty good. Of course you want very few “D” grades and no “F” grades. “F” means perf is so bad it’s not usable. You may as well delete it. Of course the grades are just a mental exercise, you’ll want to think about what does “B” or whatever mean in your context to help set goals and whatnot.

So, what does this have to do with logging? Your “C” features probably don’t need much logging. It’s pretty easy to get a “C” and so some sanity checking is probably all that’s necessary. Whereas your “A” features will probably need lots of logging, with major and minor stages and all kinds of attribution.

So survey your codebase, think about what really matters, and log according to the letter.

2. Log More Than Just Time

This may be counterintuitive but actually time is one of the least useful metrics for understanding your performance problems. I like to call it “data I can only cry about.” Now it’s not totally useless, especially if you have multiple stages and whatnot, so you can see where the time is going, but it’s not especially helpful for diagnosis. In my opinion, time is a secondary metric from an engineering perspective.

What are the primary metrics? Glad you asked. You need to think about consumption as it applies to your feature, these are metrics like “disk bytes read”, “network bytes read”, “system calls”, “major faults”, “total CPU usage”, “bytes allocated”, stuff like that. These metrics are invaluable because when they go south they give you a clue where to look. Disk bytes are up? What file? Who is reading that? More faults mean more code or data, where did that come from? There’s usually a clear follow-up action.

Consumption metrics are also better at forecasting problems in the lab, many of these metrics are insensitive to things like available memory, network conditions, and the like. They’re much less noisy than time and they are very good at telling you what changes are likely to result in improvements for your customers.

3. Integrate with Existing Logging Systems

I won’t say much about this because it’s probably pretty obvious. If you can piggy-back on existing system (whatever is normal for your operating system) then your events will have a tendency to enrich and correlate with other events creating a better overall tapestry.

Summary

Written by

I’m a software engineer at Facebook; I specialize in software performance engineering and programming tools generally. I survived Microsoft from 1988 to 2017.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store