A few days ago I was doodling a concept in Paintbrush for OS X. Eventually I needed a slightly bigger area to draw on, so I navigated through the menus and entered a new size. As soon as I hit return, I realised I had typed the wrong size. The new area was way too big. But no worries, right? I can just hit cmd+Z to undo the last operation.
As it turns out, in that particular version of Paintbrush, following a canvas resize with an undo command will crash the entire application.
But this isn't a complaint about Paintbrush, specifically. All software crashes. Paintbrush just happened to be the last one I remember.
Why does software crash on silly things like that? Nobody wants their applications to crash when the user issues an undo command. The simplest explanation is probably that something was changed in the code of Paintbrush that for some reason caused instability when combined with resizing the canvas and undoing. The bug wasn't discovered because the people who tested the application didn't think of quickly undoing a canvas resize operation.
Sure, you can probably write more unit tests. You can do more code review. You can write better code. But at the end of the day, software will still crash. We software developers are only human. We are incredibly fallible.
That's why we get software that crashes. That's why we will always have software that crashes. In this realisation there is also salvation.
Do you know what happens when the graphics driver crashes in newer Windows operating system versions? The screen flickers black, and then a popup in the lower-right corner informs you that your graphics driver just crashed, but that it has been restarted. An absent-minded user wouldn't even notice it happen. How!? What's the magic? It's simple: Windows was programmed under the assumption that all software crashes.
"Crash-only software" is the name we give to software that was designed with crashes in mind. Crash-only software is meant to handle any crash gracefully, which also means as a side-effect that crashing a crash-only application is a valid way of terminating it. There is no special ritual to quit the application – you can even issue a kill -9 command would you feel like it.
It does take time and effort to handle all crashes gracefully, though. Your application needs to be structured around it, and it will probably suffer a performance penalty. Remember that "crash" includes the computer suddenly losing power, so you can't rely on the contents of your variables, for example.
Despite the time and effort, though, the idea is that instead of spending 100 hours on preventing crashes and failing to prevent all of them anyway, you can spend 50 hours on accepting that your program will crash, and handling all unexpected crashes gracefully.
I'll take slightly slower and slightly more expensive software that doesn't fail badly over fast and feature-rich software any day of the week.
Fault tolerant software isn't just for space stations and telephone switches. It's time to get it to consumers as well.