Two Wrongs

Quality in Software Development, Part 3: Accidents

Quality in Software Development, Part 3: Accidents

Picking up where we left off, I promised to rant about accidents now.

Accidents Are Symptoms of Broken Systems

The key to handling accidents well is to realise that humans do not produce accidents intentionally. Humans mostly do not act out of malice, and it’s generally a safe assumption that nobody is in the business of manufacturing accidents.

People, on average, do not wake up in the morning and think, “How nice it would be if I went into work today and destroyed some stuff.”

In fact, humans are, at their core, just input–output machines. Sure, we are complex input–output machines, but input–output machines none the less. People act (outputs) according to their training and the best of their abilities and understanding of the situation they are in (inputs.)

In this light, “human error” starts becoming a funny term. What it really means is “incorrect inputs”: garbage in, garbage out.

In fact, if you dig deep enough, and people trust you enough to be completely honest with you, you’ll find that there is a good reason for everything people do. It might not look good in hindsight, of course, but in their mind, at that time, it was the best thing they could think of. It optimised for all the variables they knew of and valued at the time.

It might have turned out to be wildly wrong, and in that case, an investigation should focus on what made them think it was a good thing to do in that situation? Does two different error messages look very similar? Were they under other pressures, like a looming deadline? Was there an emergency that meant they had to take shortcuts? Did the situation in fact appear entirely normal and safe when it wasn’t?

What’s deceptive about this is that in hindsight, incorrect decisions look obviously incorrect. When an operator has deviated from procedure and caused an accident, it is easy to blame them for it. “We have these procedures for a reason. You should follow them.”

Interestingly, if an operator follows procedure when they should have deviated, it is equally easy to blame them for it! “You should not blindly follow the normal procedure in an emergency situation. Don’t be so stiff and rigid.”

I think it is fair to say that procedures are not a silver bullet to preventing accidents. Something else that doesn’t work is reprimanding the operator, or even “talking to them”.11 The action points of an incident analysis should never be “Steve to talk to Jane about being more careful in the future.”

What works is this: Look at what inputs the human received and how they could be misinterpreted. Figure out what other competing pressures existed. See if additional training is necessary. If it is a particularly perilous task, can it be automated away?

Analysis of Accidents

A single failure rarely causes an accident. Often, multiple safeguards have to fail for an accident to occur.

A useful way to think of it is through operational limits. A software system generally has some quantities that must remain in a particular range. For example, a web service might need to respond within 2 seconds. A message bus might need to be able to forward at least 2 million messages per second, but not more than 5 million (in order not to overload the downstream consumers.) A database must answer a query with the exact thing that was stored in it.

There are multiple TK (not safeguards, but something like that. look it up in CAST) in place that serve to maintain these operational limits.

22 Or it must not. This is how MongoDB was invented.

  • accidents are rarely due to a single cause
  • often, multiple safeguards have to fail for an accident to occur
  • service got OOM killed, customers complained
  • don’t start analysis with “machine should not have run out of RAM”
  • constrains solution space: upgrade RAM? make application use less RAM?
  • the focus on RAM is irrelevant from a systems perspective
  • start where the accident affected the customer: “service should have high availability”
  • more efficient solutions: loadbalancing with failover between two machines?
  • include some of the 7 basic tools of quality here?

Prevention of Accidents

  • operators are almost universally given insufficient tools to do the job expected of them
  • given the poor state of tooling for operators, they are doing a herculean job around the world
  • frequently seen in accidents: operator’s mental model differ from reality
  • insufficient information leads to incorrect mental models
  • too much information can also lead to incorrect mental models
  • at every opportunity, try to update the operator’s mental model to reflect the current state of the system
  • something about quick feedback loops?