Notes on Unit Testing and Other Things

People often ask me what constitutes a good amount of unit testing. I think the answer is usually a very high level of coverage but it’s fair to say that there are smart people that disagree. The truth is probably that there is not one answer to this question. Notwithstanding this, I have been able to give some pretty consistent guidance with regard to unit testing which I’d like to share. And not surprisingly it doesn’t prescribe a specific level of testing.

No coverage is never acceptable

While it’s true that some things are harder to test than others I think it’s fair to say that 0% coverage is universally wrong. So the minimum thing I can recommend is everything have at least some coverage. This helps in a number of ways not the least of which is that each subsequent test is much easier to add.

Corollary: “My code can’t be unit tested” is never acceptable

While you may chose to test more lightly (see below) it is always possible to unit test code; there are no exceptions. No matter how crazy the context, it can be mocked. No matter how entangled the global dependencies, they can be faked. The greater the mess, the more valuable it will be to disentangle the problems to create something testable. Refactoring code to enhance its testability is inherently valuable and getting the tests to “somewhere good” is a great way to quantify the value of refactoring that might otherwise go unnoticed.

Unit testing drives good developer behavior overall.

Unit Tests Cross Check Complex Intent

The very best you can do with a unit test, in fact the only thing you can do, is verify that the code does what you intended it to do. This is not the same as verifying correctness but it goes a long way.

  • you create suitable mocks to create any simulated situation you need

Given that unit tests are good for the above you then have to consider “Where will those things help?”

When creating telemetry you can observe that you log the correct things the correct number of times.

  • shipping logging that is wrong is super common and it can cost days or weeks to find out, fix it, and get the right data…

When you have algorithms with complex internal state, the state can be readily verified.

  • the tests act as living documentation for the state transitions

When you have algorithms with important policy, that policy can be verified.

  • many systems require “housekeeping” like expiration, deletion, or other maintenance

When you have outlying but important success/failure cases, they can be verified.

  • similar to the above the most exotic cases can be tested

Areas under heavy churn can have their chief modes of operation verified.

  • anywhere there is significant development there is the greatest chance of bugs

In case of complex threading, interleaves can be verified.

  • By mocking mutex, critical section, or events, every possible interleave can be simulated

The above are just examples, the idea being that you use the strength areas of unit tests cross-checked against your code to maximize their value. Even very simple tests that do stuff like “make sure the null cases are handled correctly” are super helpful because it’s easy to get stuff wrong and some edge case might not run in manual testing. These are real problems that cause breakage in your continuous integration and customer issues in production. Super dumb unit tests can stop many of these problems in their tracks.

Once you have some coverage it’s easy to look at the coverage reports and then decide where you should invest more and when. This stuff is great for learning the code and getting people up to speed!

A few notes on some of the other types of testing might be valuable too.

When to Consider Integration Tests

If you can take an entire subsystem and run it largely standalone, or wired out for logging, or anything at all like that really, it creates a great opportunity for what I’ll loosely call “integration testing.” I’m not sure that term is super-well defined actually, but the idea is that more live components can be tested. For instance, maybe you can use real backend servers, or developer servers, and maybe drive your client libraries without actually launching the real UI — that would be a great kind of integration test. Maybe this is done with test users and a stub UI; maybe it’s some light automation; maybe a combination. These test configurations can be used to validate large swaths of code, including communication stacks and server reconfigurations.

It’s important to note that no amount of unit testing can ever tell you if your new server topology is going to work. You can certainly verify that your deployment is what you think it is with something kind of like a unit test (I can validate my deployment files at least) but that isn’t really the same thing.

Something less than your full stack, based on your real code, can go a long way to validating these things. Perhaps without having to implicate UI or other systems that add unnecessary complexity and/or failure modes to the test.

This level of testing can also be great for fuzzing, which is a great way to create valid inputs and/or stimulation that you didn’t think of in unit tests. When fuzzing finds failures, you may want to add new validations to your suite if it turns out you have a big hole.

When to Consider End to End Tests

Again, it would be a mistake to think that these tests have no place in a testing ecosystem. But like unit tests, it’s important to play to their strengths. Do not use them to validate internal algorithms; they’re horrible at that. They don’t give you instant API level failures near the point of failure, there’s complex logging and what not.

But consider this very normal tools example: “I am adopting these new linker flags for my main binary” — in that situation the only test that can be used is an end to end test. Probably none of the unit tests even build with those optimizations on, and if they did they would not be any good at validation anyway, the flags probably behave differently with mini-binaries. Whereas a basic end-to-end suite of “does it crash when I try these 10 essential things” is invaluable.

Likewise, performance, power, and efficiency tests are usually end to end — because the unit tests don’t tell us what we need to know.

Fuzzing may have to happen at this level, but it can be challenging to do all your fuzzing via the UI. Repro steps can get harder and harder as we go down this road as well. External failures that aren’t real problems are also more likely to crop up.

Conclusion

It’s no surprise that a blend of all of these is probably the healthiest thing. Getting to greater than 0% unit-test-coverage universally (i.e. for all classes) is a great goal, if only so that you are then ready to test whatever turns out to be needed.

Generally, people report that adding tests consistently finds useful bugs, and getting them out, while making code more testable, is good for the code base overall. As long as that continues to be the case on your team it’s probably prudent to keep investing in tests, but teams will have to consider all their options to decide how much testing to do and when.

One popular myth, that unit tests have diminishing returns, really should be banished. The thing about unit tests is that any given block of code is about as easy to test as any other. Once you have your mocks set up you can pretty much force any situation. The 100th block of code isn’t harder to reach than the 99th was and it’s just as likely to have bugs in it as any other. People who get high levels of coverage generally report things like “I was sure the last 2 blocks were gonna be a waste of time and then I looked and … the code was wrong.”

But even if returns aren’t diminishing that doesn’t mean they are free. And bugs are not created equal. So ultimately, the blend is up to you.

I’m a software engineer at Facebook; I specialize in software performance engineering and programming tools generally. I survived Microsoft from 1988 to 2017.