I initially planned to talk about accidents here, but then I realised that doesn’t make a lot of sense. What good does it do to know how to handle incidents, when you don’t even know what an incident is?
Process behaviour is a paradigm11 I.e. a philosophy through which to view the world. that might sound obviously correct to you, but a part of you might still resist it firmly. This is simply because we haven’t been trained in it. A common response to the idea of process behaviour is, “Sure, that sounds nice and it makes sense. But it won’t work for my problem.”
It will. It is extremely powerful.
This article is going to be about how to use process behaviour to build quality into a product, and how to tell if you are succeeding. As a part of that, you will have to learn to tell accidents from routine variation.
Variation: Routine and Exceptional
Any process exhibits variation. The amounts of rain in an area varies from year to year, and this variation is random. The weather system produces a random amount of rain each year.
Human processes work the same way: A country is a system that produces (among many other things) people who are injured in traffic. In my country, on average 58 people get injured in traffic every week. I’m not saying exactly 58 people get injured in traffic every week, I’m saying some weeks more people get injured, and other weeks less. This variation is, for the most part, random.
But! You might interject. Some weeks are definitely much worse than others: the weeks when there are holidays and people drive drunk. You’re onto something.
First, an example from computing. Your download speed might not be a constant 14 _mb_/s, for example. Under normal circumstances, it will vary a bit around that number.
Maybe to keep track of this variation you are downloading a 100 mb file every day, and you’re looking at how long it takes. Normally, it hovers around 6–8 seconds. If you called your isp for an explanation every time it took more than 8 seconds, you’d waste a lot of your time getting nothing. Those slow days are just part of the process.
But then one day it takes 12 seconds! Now, surely, you have reason to call. How can we tell?
Assume we have these measurements of the time.
dd <- c(6,9,7,6,6,8,8,7,6,8,10,7,8,7,6,8,6,7,6,8,10,10,8,9,10,8,9,9,12)
We could do something very naughty: we could assume these follow a normal distribution!
Then we can estimate the sample mean
μ <- mean(dd)
and standard deviation
σ <- sd(dd)
And then if a value is outside of three standard deviations, it is unlikely from the same sequence.
μ + 3 * σ
So for 12, we cannot in confidence call the ISP.
How did we do? Not very good. This is for three primary reasons:
- First of all, there was no reason to assume normality. I know all the other kids in school do that, but they only do that because they don’t know better. Please stop. You know better. More on this later.
- Part of what we are trying to do is establish whether the process is predictable. This means figuring out whether the numbers come from one distribution or not. While doing that, we can’t – operationally or philosophically – assume that it does come from a single distribution!
- Perhpas most convincingly: before we have established that the values are indeed drawn from a random distribution, there is signal in the order of the values. This is completely ignored by a classical hypothesis test.
How can we improve the situation?
Look at what I made here: a plot! A simple, plain, timeseries.
This is good, but it’s not quite where we want to be yet. We can equip this timeseries with more powerful diagnostic tools.
- natural process limits
- how to compute that would have to be a separate article
- in the meantime, use google!
We can see more clearly now that the latest value looks bad. What we cannot read from this is whether it’s part of the process or an exceptional value.
What we will do is compute so-called natural process limits, and add them to the plot. These calculations are designed such that when a quality metric falls outside of these limits, it is likely due to a special cause that should be investigated. Points generally don’t end up outside the control limits when they come from the natural variation in the system.
This is called an XmR chart, because it shows you the individual values (X) and the moving Range (mR) of the individual values.
What’s great about XmR charts is that they are non-parametric, as the statisticians say. This means they make no assumption about the underlying data. Do you have numbers? Then XmR charts are valid. There are other types of process behaviour charts, but they are more finicky bout their inputs. XmR charts are a safe default.
We see quickly that the latest measurement can still be considered part of the same process.
However, do you notice that after day xx, all measurements have been above average? This is why the order of measurements matter, and where naive statistical tests fail: this above-average run indicates that something might have changed about the system on or near day xx.
If we compute Nat process limits based on all days before xx and plot them the same way, see what happens? The subsequent points lie well outside the limits of the old process. THIS is strong reason to call the ISP and ask what happened on or near day xx that slowly worsened your downllad speed. Maybe theyve connected a newly built neighbourhood to the same gateway and people are moving in?
The contrast between the two approaches is stark: in one case you look at the numbers and you waste time taking action on meaningless noise. In the other, you apply very simple but clever analysis and you detect a real shift in the system that would otherwise have been imperceptible!
Process behaviour charts are a secret superpower for knowing which things matter and which don’t. They are the system whispering to you.
It is important to emphasise that process behaviour charts are not statistical hypothesis tests. These charts tell you nothing about probabilities and distributions. Process behaviour charts are just meant as a visual guidance to help you understand how the process is behaving. The computation of natural limits is not based on mathematical axioms: these calculations are chosen because they have shown practical benefit over many decades of use. If you think they make your chart too sensitive, or nor sensitive enough, tweak it. This is your window into the process, and if it’s not showing you the right thing, move it!
Do exercise some common sense when moving the window. The natural process limits are the process telling you what it is capable of. The process does not care one bit about whether or not you like those limits. If you move them simply because they paint an unflattering picture of your process, you’re only lying to yourself and making your own job impossible. The only valid reason to move the limits is that your experience show that your system has a narrower or wider range of normal operation than indicated by the limits.
Why does this matter again?
Not only is taking action on normal variation a waste of time: it can be actively harmful, too. By trying to calibrate parameters of the process based on normal variation, you tend to amplify the swings of the process variation. In the worst case, you end up in a vicious cycle where you make more and more desperate and drastic corrections to compensate for the increasingly high vaalues of natural process variance, that are just the systems responses to your corrections.
Once you start looking at systems through the lens of normal variation and special causes, your view of the world will not be the same again. Many bad outcomes are NOT in fact the singular terrible events, but a natural consequence of the process. Politicians are especially bad when it comes to this: their entire job is basically to take normal variation and through rhetoric exploit it to rally people behind them. Imagine e.g. a rehabilitation program for criminals that is generally successful. It might still produce 15% relapsing criminals, but this is better than the 25% in the general prison population. If someone told me it produced 0% relapsing criminals, I would know they are lying. But then maybe one of those 15% do something tragic. Imagine how quickly politicians would get up on a soapbox and talk about restructuring or even shutting down the rehabilitation program. Even though all evidence they have is that it works as designed: 15% relapsing criminals.
To recap, the guide to building quality into a process is this:
- Track quality metrics on XmR charts – you now know anything at all about your process.
- Remove all exceptional causes of variation – you now have a predictable process.
- Change process to reduce variance for better predictions.
- Change process to reduce mean value.
Quality by Design
- quality requires planning
- if someone tells you that specs and requirements elicitations aren’t agile, they are misguided
- agile is about reacting to change
- without a spec there is nothing that can change
- the very concept of change assumes there is something that has changed
- this something is a spec, however informal
- the earlier a problem is discovered, the cheaper it is to fix
Improvement of Quality
- you cannot improve what you don’t measure
- focus on current results, not daydreams about future
- be very extra super careful you are measuring the right thing
- this ties in to quality by design: design your measurement process, don’t ad hoc it
- optimise for learning: fast cycle times, record results, reflect
- known also as pdsa: plan-do-study-act
Special and Common Causes
- first, root out special causes
- then optimise variation