Why Fail-fast is a really good idea most of the time

Rico Mariani
5 min readAug 19, 2020

So the usual caveats apply: In the interest of brevity, what I’m going to write is only approximately correct, so do take this all with a grain of salt.

So here I’m talking about Fail-fast in the context of coding, not experimentation. This is the idea that there are many cases where your program should just abort rather than trying to recover. Classic examples of this sort of thing include “exit() or abort() on out of memory” or “out of disk space”. That sort of thing.

People are often surprised by this approach, having been trained to (e.g.) guard every malloc() but you can actually get a fairly robust system on the fail-fast plan with maybe less work overall.

Wait, Why Not Just Guard Every Malloc?

The problem here is that you really do have to guard every last malloc because once they start failing they’ll probably fail in a huge cascade. If you miss some, you’ll likely crash in short order, but by then you’ve run some code. That code is likely to be a path that hardly ever runs is not especially well tested — remember any combination of allocations might be failing here. Worse, you might have partly committed some of whatever is pending somewhere more durable, like disk, or a database — and while it’s always possible to unwind any operation if you’re sufficiently careful, in practice it can be astonishingly hard to do so and it is likely to require a lot more resources. A simple example will help.

Just Consider The Display… For Starters…

When I first started making this case to my own team I was actually working on browser tech. I was trying to explain that the fast crash was likely to save us security issues because we would have fewer cases of limping along in some unknown or at least badly tested state. That argument was pretty compelling, but many people still were thinking along these lines: “memory could come back, it often does, we should hang in there and resume when we can”. That works for for say kernels and databases because they can always unwind but it doesn’t work so good for most applications. I gave this example:

We’re in the middle of drawing. Everything is going fine, we’ve drawn most of the screen. We have a stock ticker symbol, a recent quote, we even drew the word “BUY” but oops we ran out of memory. No worries, it’ll be fine. We’re pros, we can continue without crashing, we’ll just draw the word “DON’T” in a few minutes when memory is back. What could possibly go wrong?

Of course “DON’T BUY” is just one simple example of what might be wrong with the display at any given moment. In general you can’t afford to draw some of the frame and present it as a finished product. It might be a reactor monitoring system or something equally important. Crashing is better… at least they’ll know something is wrong.

The problem here of course is that there are very few (zero?) programs that can actually undo their display list. And this is true for a lot of data structures in most applications. While it is possible to two-phase-commit anything, in practice nobody actually does, because in most cases it’s entirely impractical. That means if you proceed after an event like out of memory you’re likely to find your system in uncharted territory, and the pixels might be the least of your problem.

But Crashing Is Bad!

Crashing is bad, but it isn’t the worst thing you can do by any means. Data corruption, privacy breaches, improper processing that looks correct, all of these are actually far worse than a crash. Poor cleanup can easily result in security vulnerabilities, especially if the bad cases can be forced by an attacker. In contrast, crashes lead to inconvenience, not breaches. So the idea is: “stop, before we do anything worse.”

Mitigating The Crashes

You can do more than just crash. The fail-fast approach really works best if it’s tied with other strategies like:

  • with process isolation you let a worker process die and then some manager spins it back up again automatically, hopefully cleaner (maybe a slow leak is mitigated like this); lots of web servers work like this.
  • with resource monitoring you disable more and more of your application as resources get low such that it stops being willing to do anything expensive; maybe you just harden the “save” path so in the end you get to save and nothing else until memory is abundant; lots of apps work like this

Both of the approaches above are in some sense better than checking everything because you can make the customer experience materially better without having to be perfect and you still don’t have to deal with all kinds of bizarre states your system might be in.

Fundamentally, this is the problem with the “guard everything” approach. You really do have to guard everything and it’s not easy. Kernels do it, databases do it, you could do it, but it’ll cost you. Doing part of your state in a very guarded way is quite tricky. For instance, suppose you have a nice persistent store, say SQLite or something. SQLite will give you good consistency but if the rest of your system isn’t coordinated with it, then you can’t necessarily recover to an overall state that’s consistent even though SQLite did its job. If your in-memory state or your side-state on disk is out of sync you might find yourself in trouble even after a restart. You might find yourself wishing you persisted nothing but the database and you just exit clean if things are going off the rails. That’s going to be a lot easier to test and maintain. Could you coordinate more? Yes you could. Is this going to be easy? Not even remotely easy.

Summary

So, in short, the main benefits: you write less code, you don’t have to try to handle any of the “half done” states, you use your persistence and recovery story to give the best experience you can to your customers, and maybe add some error prevention logic to help keep instances down. Overall this tends to lead to applications that are more reliable and have fewer catastrophic problems because they tend to fall on a sword if things are getting weird rather than muddling a long in that weird and/or broken state.

--

--

Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.