Why Fail-fast is a really good idea most of the time

Wait, Why Not Just Guard Every Malloc?

The problem here is that you really do have to guard every last malloc because once they start failing they’ll probably fail in a huge cascade. If you miss some, you’ll likely crash in short order, but by then you’ve run some code. That code is likely to be a path that hardly ever runs is not especially well tested — remember any combination of allocations might be failing here. Worse, you might have partly committed some of whatever is pending somewhere more durable, like disk, or a database — and while it’s always possible to unwind any operation if you’re sufficiently careful, in practice it can be astonishingly hard to do so and it is likely to require a lot more resources. A simple example will help.

Just Consider The Display… For Starters…

When I first started making this case to my own team I was actually working on browser tech. I was trying to explain that the fast crash was likely to save us security issues because we would have fewer cases of limping along in some unknown or at least badly tested state. That argument was pretty compelling, but many people still were thinking along these lines: “memory could come back, it often does, we should hang in there and resume when we can”. That works for for say kernels and databases because they can always unwind but it doesn’t work so good for most applications. I gave this example:

But Crashing Is Bad!

Crashing is bad, but it isn’t the worst thing you can do by any means. Data corruption, privacy breaches, improper processing that looks correct, all of these are actually far worse than a crash. Poor cleanup can easily result in security vulnerabilities, especially if the bad cases can be forced by an attacker. In contrast, crashes lead to inconvenience, not breaches. So the idea is: “stop, before we do anything worse.”

Mitigating The Crashes

You can do more than just crash. The fail-fast approach really works best if it’s tied with other strategies like:

  • with process isolation you let a worker process die and then some manager spins it back up again automatically, hopefully cleaner (maybe a slow leak is mitigated like this); lots of web servers work like this.
  • with resource monitoring you disable more and more of your application as resources get low such that it stops being willing to do anything expensive; maybe you just harden the “save” path so in the end you get to save and nothing else until memory is abundant; lots of apps work like this

Summary

So, in short, the main benefits: you write less code, you don’t have to try to handle any of the “half done” states, you use your persistence and recovery story to give the best experience you can to your customers, and maybe add some error prevention logic to help keep instances down. Overall this tends to lead to applications that are more reliable and have fewer catastrophic problems because they tend to fall on a sword if things are getting weird rather than muddling a long in that weird and/or broken state.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rico Mariani

Rico Mariani

691 Followers

I’m a software engineer at Facebook; I specialize in software performance engineering and programming tools generally. I survived Microsoft from 1988 to 2017.