Testing and Instrumentation
Last week Charity Majors wrote a series of tweets about testing software. The general thesis being:
- you can easily hit diminishing returns in testing, and people often over-invest
- there are so many combinations it’s just not at all feasible to prevent all your problems with testing
- therefore, you should really set things up so that you can detect things in production when they happen, and,
- you want it to be the case that, if bad stuff happens in production, it’s not that bad
I think this is all right on. However, I also think that sometimes people take good advice and read it wrong e.g. “premature optimization is the root of all evil” is only part of the quote and if they read the whole quote they’d likely do a lot better. See here and here.
Disclaimer: I’m about to play the dangerous game of telling you what I think Charity is saying without actually talking to her about it directly and so if I’ve got any of this wrong, I’m very sorry. Please don’t assume we have agreed in advance on all of this. And frankly there is still plenty of room for reasonable disagreement as you will soon see.
Conclusions you should not make:
Testing has diminishing returns so don’t test
Notwithstanding that it’s easy to go overboard, most systems you will encounter are not remotely in danger of going overboard… it’s far easier to encounter systems with single digit % coverage.
High levels of coverage are inherently overkill
You really have to decide for your own system what the appropriate level of coverage is going to be and stick to it. It’s actually not that hard to get to 100% block coverage in any codebase and in many (most?) cases, for professional software that’s going to be worth it, I consider it “ante”.
Let’s dig into this a little bit. Why is it worth it to have baseline coverage at 100%?
The main thing is this: it means that you have a suite that ensures that every line of code runs at least once, and if they are unit tests (I hope they are) it can run in an automated fashion with clear failure modes. And why do we care? Well, the tests of course do some validation, even if it’s just basic validation, but even if they did only the very most basic correctness checking you still can milk that because there are various basic invariants you can enforce, e.g.
- for native code, you can run the tests with ASAN, TSAN, LeakSAN etc.
- you can hook all your logging code and ensure that no PII ever leaks
- you can enforce other invariants such as all UTF8 conversions are checked or, all SQLite statements are finalized
- you can use the correctness tests are a base for fuzzing in various areas
Let me talk about each of these things specifically. I’ll give some real-world examples from my job.
Sanity Check
ASAN/TSAN/LeakSAN etc. are self-evident I think, you just build the test suite with the appropriate compilation mode and run it as usual. Any test failures give you immediate actionable signal. This is really great at stopping huge classes of wild pointers and double frees.
PII
If you set up your test infra so that all sources of test PII are readily recognized (e.g. they have a poison pattern) then you can look for that pattern in places it does not belong. For instance maybe every user name in the test suite includes the text “$user”, you can now shim all the logging APIs and if any “$user” ever appears you fail the test on the spot. Likewise you can look for “$user” in any stored files and ensure it is not anywhere it’s not supposed to be. Note many applications are in the business of storing PII but that doesn’t mean it’s allowed to be everywhere, it’s supposed to go in certain exact places and you can automatically scrub test results to ensure it isn’t anywhere inappropriate.
Invariants
The test infra I use has a facility to shim/fake all of SQLite, this is useful for creating whatever error conditions you might need to create, that’s very normal. However, in addition, the SQLite shims also track prepared and finalized statements and the test system will fail your test should these calls not balance. This means you can be confident that statements are getting finalized even on error paths, or exotic paths. This isn’t perfect as we’ll see below but like ASAN it goes a long way.
Other patterns can be checked with some helpers. For instance, there might be many places where your system converts free UTF8 text into “strings” and the conversion might fail if the input is not well-formed UTF8. It’s easy to miss these. So again, you add a counter in the UTF8 converter so +1 on each conversion and then you use a check function to validate UTF8 strings for non-nullness which also does a +1. It’s cheap and it’s very good at finding places that are missing the check.
Error Fuzzing
Some paths are hard to check, or new paths can be added, to keep yourself sane a “fuzzing” solution can be better than ad hoc tests for every failure combo. For instance, referring to the SQLite example again, it’s easy to shim SQLite and in the course of a test that should pass you can count all the calls, 5 prepares, 6 steps, 12 column reads, 5 finalizes, and so forth. You can then re-run the test in sort of a “failure expected mode” where you:
- run the test as usual, but fail the nth SQLite operation using the counts you previously got
- verify that ASAN etc. still works, and all invariants still are, well, invariant
- to pass, the test should also report “failed” for each of these variations or else either an error case was ignored or else the test did not properly detect a failure case, either way a fix is required
- after this you can be sure that every SQLite API call has at least minimal error checks where needed
This technique can be adapted to any observable system — anything you’re mocking. Have the mock report a failure
When To Stop
You could keep adding tests literally forever, after 100% block coverage comes 100% branch coverage, which is a lot harder to achieve, few codebases (e.g. SQLite) even aspire to this. If your library is going to be used in a few billion devices and you want a regular cadence you might also want this much coverage, but it’s likely overkill.
And what about combinations? Even 100% branch coverage doesn’t give you proof of correctness (SQLite still gets bugs from time to time). For instance, a lot of bugs can come from just being flat wrong about what “correct” even should be. If you think you’re supposed to return “5” in some case but you are actually supposed to return “7” a unit test is likely to just verify that the code is doing what you thought it should do: return “5”. Maybe if someone else wrote the tests that might help. But then you’re still left with all the combinatorics of “this block ran before that block then this other block didn’t run and then this final block did run but in this special way.”
There’s just no way to get all the combinations. It’s simply not happening.
So then what?
Well, two things, and now we come back to Charity’s thesis:
- You have to know what kinds of mistakes your devs tend to make, and which ones you simply cannot afford to allow into production (e.g. security, privacy), focus your tests there
- For everything else you need insight from production, and that means high quality actionable instrumentation and you better have tests for THAT. If I had a dime for every bit of useful logging that went untested and shortly became useless… Telemetry is Oxygen.
You can temper all of this — maybe you have a pre-production environment; maybe you have incremental roll-out; maybe you have integration and end-to-end tests as well as unit tests. These things also add cost, but they can also save your bacon. And these environments can be your first line of canaries in terms of what’s going wrong and, if certain mistakes are common, they tell you where you should shore up your tests. Remember, you want to catch the typical mistakes the soonest. Let experience guide you. Testing literally everything is not possible, and certainly not remotely economical.
It all comes down to this: We can’t every afford to ship disasters but there are a lot of things we can do to make disasters exceedingly unlikely.
Every system has some “ante” level of tests and even 100% block coverage is not that high of a bar, with good mocks it’s just not that hard to hit every line of code once. But every non-trivial system soon explodes in combinatorics. Shipping non-disasters can require that you have say 100% block coverage, but it can’t require that all the code combos are explored. It’s simply not doable.
If you want to ship with good cadence, you’re going to need tests to ensure your essentials are well protected — economically protected — and then you’re going to need live insights for the rest. If you don’t have both things, you’re living in some kind of dream that will soon be a nightmare.