100% Unit Testing — Now It’s Ante
Introduction
I have written many times about the importance of high-quality automated tests in software development. I’ve been especially vocal about the need for complete unit test coverage. Sometimes I get push-back, and usually I can defuse that push-back with some (I think) well-reasoned philosophy.
At my core I am a pragmatist, I do things because they work, not because they look nice on paper. So, my choices tend to be rooted in a desire to get specific outcomes and that often makes my choices easier to understand. Communicating these outcomes and the choices I make because of my desire to achieve them is the point of this article.
The Full Breadth of Testing vs. Unit Testing
I have never talked or written about the full breadth of testing, at all. This may be the first time I do so — and I’m only doing it now to make it clear why I don’t do it and what I’m talking about instead. In order to consider some product, or library, or service “fully tested” there are many, many, angles you have to probe. I’m going to quote a paper by Dan North where he describes several such things. In his words, we have:
- Functional correctness: It doesn’t produce the results we expect.
- Reliability: It mostly shows correct answers but sometimes it doesn’t.
- Usability: Sure, it works but it is inconsistent and frustrating to use.
- Accessibility: Like usability, but exclusionary and probably illegal.
- Predictability: It has random spikes in resources such as memory, I/O, or CPU usage, or occasionally hangs for a noticeable amount of time.
- Security: It works as designed but it exposes security vulnerabilities.
- Compliance: It works, but, for instance, it doesn’t handle personal information correctly.
- Observability: It mostly works, but when it doesn’t it is hard to notice and harder to determine why.
Without getting too much into each one I think we can see there’s overlap, and I think we can see that some of these things are not completely objective. Many are likely to have requirements that evolve over time and require re-visiting. Some of those properties make them pretty poor targets for automated testing.
When I talk about testing — and unit testing in particular — I mean something much simpler than any of the above factors. I mean, for the most part, “Is the code at least doing what I intended it to do?” My contention is that this is about the best you can hope to do with automated testing, and it is less than all of the above. It isn’t even necessarily “functional correctness” — but I still think this amount of testing is essential.
It’s my contention that the practices I describe come at significant negative opportunity cost (i.e., the expected cost you will pay for not doing them far exceeds your investment to do them). This assertion usually results in an avalanche of objections, so in the rest of this article I’ll try to deal with the most common objections directly while describing the outcomes I’m looking for.
“The Juice Isn’t Worth the Squeeze”
There are several objections that circle around the same considerations:
- “Tests beyond X% give diminishing returns”
- “100% coverage doesn’t guarantee correctness anyway, why bother?”
I generally recommend 100% block coverage as a minimum. Sometimes people find this incredible but, as I said, I have specific goals in mind. Let’s deal with the objections first and then on to the goals.
That unit-testing inherently suffers from diminishing returns is just flatly wrong.
Firstly, if you have (e.g.) 1000 tests the 1001st test is not inherently harder to write than the 1000th was, that never happens. If anything, it’s typically easier to write the n+1st test. This is because tests tend to use all the same infrastructure to get their job done.
Secondly, tests encountering new untested code are just as likely to find bugs in that code whether it was the first 1% or the last 1%. This is borne out in my experience all the time. Down to a few blocks to test I’m thinking “well this is a waste but ok I’ll do it because hygiene” and then I look at the code thinking about how to test it and I see it’s wrong. This happens all the time.
If writing the next test is hard, and you feel like returns are diminishing, it’s almost certainly because you are entering an area with a new/unique dependency, and you do not yet have the tools to do the required mocking in your hands. This is not a reason to stop. But it is a reason to expect a new wave of bugs due a bunch of code that could not be tested with your current tools.
My most recent drive to 100% was in the Messenger code base in its lower layers and SQLite is used extensively. It’s a little tricky to mock, but not that tricky; in half a morning, I was able to create fake connections (the returned connection is a string literal) and fake statements (also string literals) and yield fake results from them with a test setup that was a lot like Mockito.
Having done this, the “Achievement Unlocked” music plays and entire new classes of problems are yours to conquer. It is suddenly trivial to fake cases where statement preparation fails, or connection fails, or sqlite3_step
fails, or commits fail... These are excruciatingly hard to test any other kind of way and the code was full of bugs in corner cases. This is where you find bugs like missing or inappropriate error logs, memory leaks, and all manner of such things down the important but rarely-used paths.
Later, we’ll talk about the value of good tests, but for now, as a thought experiment, suppose the tests merely execute the flows and otherwise verify nothing. Even then (and I don’t recommend this) you still get tons of value. Those same tests can be run under ASAN
, LEAKSAN
, TSAN
, UBSAN
, all the SANS
! That means, with no extra effort, if you (e.g.) leak a connection, even if the test validation doesn't notice the leak, your test infra will. And that means bugs like “we don’t properly clean up this connection on the X error path” can’t happen.
Now if it is truly epically hard to shim in a dependency, something is fundamentally wrong with your code. That situation itself is already telling, and you should use any difficulties you have testing as a solid reason to refactor and simplify not as a reason to write super complex end-to-end tests for everything, or something equally bad.
Again, the net of this is that you get two chances to think about your overall design and how the code is factored. And when it comes time to test it, if you feel like it’s hard to test because the control flow is crazy, this is telling you something! Don’t write those tests, fix the code.
When I did this in Messenger I found all the usual things. Dead checks for errors that couldn’t possibly happen, and missing checks that were totally necessary. Missing error logging, inappropriate error logging, memory leaks, you name it. Testing boils out problems systematically.
Now, why 100% and not some other number like say 98%. People are going to make mistakes and forget to test. At any given moment it will be less than perfect. If you’re doing well, you will “bounce” off of 100% frequently. But, since the goal is 100%, it’s easy to see what’s missing and track it. You find who added those lines and open tasks. If necessary, you horse-trade or whatever is needed to get it done. But you never have blocks of code that have a durable exception. The reason is that as soon as the goal is less than 100% you have to have this steady state “oh but not that stuff” in your head. This is way too hard to monitor. The more protest there is in any given area, the more likely it is that it badly needs testing.
Now as for absolute correctness, I never even aspired to that. The best we can hope for — the best — is that the code is working exactly as the author intended it to work. If the author is simply confused about what the correct result looks like, then no unit test written by the author is going to find such a problem. They will of course test that it works the way they think it’s supposed to. So, barring a personal revelation like “wait a minute this isn’t right” — which might happen — we can’t expect unit tests to save us. That’s ok. Getting to a point where the code is at least working as intended is already a huge step that should not be short changed. The rest has to come from feedback, and probably not even automation.
Remember we don’t need perfection for the tests to be valuable, only economy. And just one saved incident can easily pay for weeks of testing effort. My experience with Messenger was that just me writing tests for two months took at least two calendar months off the release date for the entire team. Those tests are still paying dividends, years after I wrote them and I’ve moved on to another company.
So, the outcomes we’re looking for are:
- everything runs in a test at least once to enable sanitizers,
- error cases get exercised, however rare,
- it’s super easy to spot the code that still needs tests.
“You Can Game the System!”
It’s true you can game the system. Probably the easiest way to game the system is to write bad tests. But the possibility of bad tests doesn’t kill the whole idea.
First of all, with a little monitoring, bad tests are not such a big problem. Every bad test identifies a developer that needs more help understanding how to write good tests and what their value is. Understanding these things will help them to be a better developer. Making better developers is the most important thing I can do in my job. So, ironically, bad tests help me do my job. And the tests don’t stay bad if you are watchful. The flaky tests are the ones that need the most watching. But I digress.
So, in short, yes, you can game the system. But only if nobody is watching.
If you review the tests just as you would the code, and take note of what the testing is telling you about the code, you are going to find great opportunities.
If your developers are writing bad tests with tons of copy-pasted logic maybe the problem is that you didn’t invest in a good faking system, and the devs find it easier to clone whatever works than do it right. Developers do what works, so make that work for you. Pit of Success is a weapon in test frameworks, too. Bad test smells are valuable information about your test infra, your code infra, and your dev culture. All of which are invaluable.
Developers are always able to do things they aren’t supposed to. “Look Ma, I’m casting this const
away weeee...” We simply don't let it happen with a process of reviews and CI/CD and so forth. So read the tests and review them. Don't let them get bad. Learn from bad tests just like you learn from bad code.
“The Code Already Runs in an Integration Test”
Integration Tests, and, even worse, End to End Tests, are notoriously hard to keep running and to debug when they fail. The more of the system you “integrate” the more complicated the tests become and usually the slower they run and the more weird kinds of errors they produce.
Integration Tests generally validate far fewer outcomes during the course of their run than unit tests because what is visible to them is much more limited and the number of error paths and variations they can afford to explore is much less.
Unit tests by contrast can run maybe 10,000 times faster than Integration tests — often finishing in microseconds. They can use sneaky tricks like pretending a 30s timeout happened even though it’s only been 30ns by faking the clock. They have more precise control over exactly what code runs and typically report issues very near to the source of the problem. An integration test might record something bogus in the log and not see an actual failure until minutes later, making diagnosis much more complicated.
And yet… unit tests have the huge limitation that they are bounded by the imagination of their author. So, we must add at least some larger tests if only to generate the noisy reality that is normal in our systems so that it can inform our unit tests over time. A mix is crucial.
Unit tests are only a baseline. But if you have 100% then you know for sure there is no error you can’t mock so when something important does come along, and it will, something that you really need to cover, you will be equipped to do so.
“The Tests Keep Breaking! It’s Slowing Me Down!”
If this is really slowing you down, then you have Bad Tests again. They are Bad for probably one of these two reasons:
- When a unit test fails it should be trivial to see why it failed — they are not like end-to-end tests the failure should always be close to the issue.
- When a unit test runs it should validate appropriate things, not everything — over-validation causes specious failures and makes refactoring painful.
So, it comes down to validation, I recommend two kinds:
- Check for evidence that the code you thought was running is in fact the code that ran. Sometimes tests look for a certain result code, and they get it, but the subject function never actually ran because the test arguments were not right. Not an error, but not right for this test case. The test has validated the wrong success (or failure).
- Check for evidence that the thing you expected to happen, happened. The file was opened, the lines were added, the record was inserted, or whatever the case may be. Any errors that should have been logged are also validated here. The testing system can be should also be looking at other things not specifically validated by the test but that are normal invariants in your universe. Such as “all the statements that were prepared were also finalized” or something like that.
Don’t create unit tests that effectively demand that everything happen exactly like the last time. Things change. Other systems may add logging not directly relevant to the test that incidentally has to run. You don’t want to have to control for everything. So, validate what is essential about the test, and not all the other random things that also maybe happened. Everything should have its own tests that specifically validates them.
Now if the failures are pretty obvious and they require a tweak for your new behavior, this is not bad, this is good. When you make a change and a test or two fails this should give you confidence! You can see the tests are looking at your changes and validating. If you just changed some rules, you can see immediately what the new correct thing should look like and make any needed test adjustment. It’s like double-entry accounting — the tests are the credits the code is the debits. They match.
It’s normal for some tests to break in easy obvious ways when you change behavior. If no tests break after any significant work, something is wrong.
Now integration tests, that’s harder.
“This Code Hardly Ever Runs; It’s Not Worth It!”
“Oh, my sweet summer child,” Old Nan said quietly, “what do you know of fear? Fear is for the winter, my little lord, when the snows fall a hundred feet deep, and the ice wind comes howling out of the north.” ― George R.R. Martin,
There was a time when crashes were the big problem facing professional developers. In those days, the code that ran the most often was the most important. That was Summer. It’s Winter now.
In Messenger crashing was not even remotely close to the worst thing that could happen. In fact almost always crashing is vastly preferable to the worst thing. By the way, in Messenger, by far the worst thing the application can do is deliver a message to the wrong destination. That is a catastrophe and so the guards on that path are, ahem, significant.
The details vary but whatever your situation may be, I feel that these days crashing is almost certainly not the worst possible problem. When I worked on Microsoft Edge the priorities were:
- Do not let the customer get pwnt (security).
- Do not leak the customers data (privacy).
- Render the content faithfully.
- Do all this economically.
Why? Well, #1 comes first because if you lose #1 you’ve also lost #2. Both of which matter more than #3. And #4 is moot if the content is all wrong (me: “It’s easy to make it fast if it doesn’t have to work”).
Doing badly on any of these is not a good place to be, but crashing when you get into an unknown state rather than risk #1 or #2 above is easily the better choice.
You see, attackers do not care if the code they are attacking is on a rare path. In fact, they probably prefer that it is on a rare path because such code less likely to be tested or fixed. ALL the paths need to be run under ASAN or whatever other tools you are using to help shake out problems because attackers will force the weakest code to run.
Turning to privacy now, a data leak on a path so rare that only 0.1% of your customers see it is still a catastrophe. For a large company 0.1% might still be half a million people. The cost in lost confidence, to say nothing of possible fines, is staggering.
If all the code runs you have a fighting chance to peek at data created on each path and (e.g.) spot places where the wrong data is getting logged; spot places where locals were uninitialized; spot places where your business rules were not followed, whatever they may be. And if you need to make changes, the very same tests will give you confidence that you haven’t screwed everything up in the process of “fixing it.”
Simple techniques like controlling all the surnames, or SSNs, in the tests, so that you can easily spot them in places they don’t belong are invaluable because even without explicit validation your test infra can always be watching for such violations.
How many privacy incidents do you have to prevent to pay for all your testing this year? Got a number? Now consider all the tests you wrote this year, how many incidents will they prevent next year with just basic upkeep?
Conclusion
100% unit-test coverage is absolutely positively not enough to get you to the finish line. You will need a lot of other kinds of things to round out the picture. In 2024, 100% unit-test coverage is only ante.
One of my favorite test managers had this notion of “acceptance testing” which meant, “good enough to accept for real testing”. And it was this baseline set of things the code had to be able to do to even bother scheduling a real test pass. We used to think of it as “ante”, it was our real evidence that we were “code complete”. Unit tests can fill this roll. The code may not be perfect, but it is at least substantially working as intended.
As new problems filter down from other testing techniques, some may be shifted left into the unit tests, locking down those behaviors. Some problems may be hard to shift. Fair enough. But everyone has to pay the big blind when it’s their turn — or go home.