Two Wrongs

Statistical Process Control: A Practitioner’s Guide

Statistical Process Control: A Practitioner’s Guide

There are two ways to present statistical process control: one for the practitioner, and one for the statistician. This is the practitioner’s version. In this version, the reason behind some things will be given as, People have relied on these tools to increase quality and decrease costs for over 80 years. Trust me, it works.

That reason might not sit right with some of you, so try to rest assured that there is sensible statistical theory behind it, too. I just won’t go through that now.

What Is Statistical Process Control?

Statistical process control (spc) is a robust framework for separating signal from noise. It is worth learning because it is easy, and it gives you superhuman ability to put your effort in where it’s valuable. It has made me, personally, more productive than I could have dreamt of being before it. It helps me be a better manager for my team, by not sending them on unnecessary wild goose chases, and pointing out areas for genuine improvement.

A part of statistical process control is also a particular outlook on life, a resigning to the fact that all real-world events are governed to a large extent by noise, randomness, and luck, and our ability to influence individual outcomes is limited.

Here are some questions that we will look at through the lens of spc throughout this article:

  • How long does it take my team to finish work on a bug fix?
  • What are our cloud costs?
  • How many sales calls do Alice make per week?

You will recognise that there is not a single answer to any of these questions. How long it takes to fix a bug depends on the bug. What the cloud costs are varies from month to month. Similarly, Alice makes many calls one week and fewer another.

Bad Comparisons

In order to understand measurements coming from the questions above, we must look each measurement in context. First off, let’s look at three incorrect – but extremely common – ways to do this.

Compared Against Previous

A very common mistake people make is comparing a measurement to the last measurement of the same thing. Here’s a weekly report on Alice’s sales calling performance:

  • This week: 75 calls
  • Last week: 84 calls
  • Change: -11 %

If Alice has a bad manager, I can hear what they’ll be saying already:

You’ve had a rough week, Alice. What’s wrong? Have you been slacking off? Your numbers are down. How can we help you? You gotta improve your numbers. We aim for excellence here. We want to see continuous improvement.

Of course, you and I know better. Alice will very rarely make the exact same number of calls each week. It will always be different. Half the time it will be more, half the time it will be less. Sometimes the difference will be large, and sometimes it will be small. This report contains no useful information.

Note that much of the news reporting you read or see takes precisely this form. “There was noise in data again! Read all about it!”1 At this point it’s a running gag in my family to point at news and call out “bad comparison” if they’re making a big fuss about noise between two consecutive measurements.

Under management pressure, Alice will spend time to make up an explanation for her bad numbers this week compared to last. If she’s unlucky, she might even believe it herself, wasting even more effort in trying to correct a problem that doesn’t exist.

We, as humans, waste a lot of time explaining noisy measurements. Knowing spc, you can do better.

Compared Against Average

Another common mistake is to compare measurements against an average of measurements of a similar kind. Here’s a report of our cloud costs:

  • Last month: $31,583
  • Twelve-month average: $29,801
  • Excess: $1782

The bad manager goes off again:

This is unacceptable! Last month, we spent above average on cloud resources. We need to switch to a cheaper provider. If this happens again, I will personally come down there and look over the expense report in detail. Next month, I want every last cent accounted for and explained!

Again, you and I know better. A statistic known as “average” is intentionally designed to fall in the middle of the range2 Exactly where depends on what kind average you’re talking about; arithmetic mean, median, geometric mean, etc. While it is not strictly mathematically true that averages are in the middle, in practise we find them there often enough that thinking of them that way can help understand the effect of natural variation, or noise, in data.. Roughly half of your measurements will be above average, and the other half below it3 Also not strictly true for averages in general. See the appendix for a brief discussion.. Again, this report contains no information at all.

This is also something you often see in news: “the number of criminals in our city is more than the nation-wide average for similar cities! Doom and disaster!” If you pick a statistic that is designed to have around half of the measurements worse than it, you can make news out of anything.

People will waste a lot of time reporting on the details of what may well be a fully normal month of cloud costs, to make a manager feel better but accomplishing nothing of value at all.

Compared Against Specification

This last one is the most insidious, because it’s not always wrong. Sometimes, it’s correct. And to some people, the difference is subtle. This report was automatically triggered, based on specification limits:

  • Bug: Occasional invalid encoding in output of block node 5
  • Engineer: Bob
  • Resolution time: 4 days
  • Target resolution time: 3 days

What does the bad manager say?

So, Bob. We need to talk about your performance. Our target is to fix bugs in three days. Do you know how long it took you to fix the last bug you worked on? You need to work on your speed. Can we assign you a mentor? Should you be moved to a different team?

Say it with me: you and I know better. The specification limit of 3 days is drawn out of thin air. There is nothing about the bug fixing process that says it shouldn’t take more than three days. Sometimes it takes one day, and sometimes it takes 19 days. But someone, at some point, commanded that “this but no further” and that, somehow, is supposed to make bug fixes go faster?

Specification limits, in cases like these, are wishful thinking. You cannot make something better just by wishing for it. Again, we are interpreting noise; we classify the noise as bad because it exceeded a number someone pulled out of their ass. There’s no information in the report above, except perhaps that Bob fixed a bug.

But there’s a second kind of specification limit! The typical example is when you are machining a component that should fit into an assembly of some sort. You need the component to be of the right dimensions so it actually fits. This is a valid usage of specification limits, although it also has some problems, so I would avoid it.4 Specifically, just because something is in spec doesn’t mean it’s good. The specification limits are usually fairly wide, and components that are just inside them can easily cause maintenance problems down the line anyway, even if they technically fit at the time of assembly. And two components at opposite sides of their specification limits might not fit well together even at assembly, despite both being in spec individually! Also, by being binary true-or-false tests, specification limits don’t help you improve your processes. You’re either fine or you’re in trouble, there’s no middle ground.

Then there’s a third kind of specification limit. These can more reasonably be called natural laws. For example, if you arrive late to the train you will miss your meeting. There’s a definitive threshold at which you must be at the train station, or the train will leave without you. This is the fully acceptable use of a specification limit. If you break this type of specification limit, there will be a bad consequence that is out of your control.5 Note that if you have a contract with a customer that specifies that unless you solve their bugs in three days, they can invoke a penalty clause that forces you to pay them millions of dollars, then the example we opened with (bugs must be solved in less than three days) becomes a natural law, rather than an arbitrary specification limit.

You might recognise other names for the wishful thinking type of specification limit. Budgets, goals, and targets fall into the same category. If someone says, “We are going to increase revenue by 10 % this year!”, the obvious question is, “What happens if we don’t?” Usually nothing. It’s just wishful thinking. Similarly, you might go, “I will not spend more than $650 this year on car repairs.”, …but what if you have to? Wishful thinking.

Wishful thinking never improves anything, though it has potential to make things worse.

A Stable Process

Let’s zoom out and get some more context for Alice’s sales calls. These are the last 8 weeks of sales calls for Alice.

86 96 65 101 90 70 85 75

Humans generally have trouble understanding large tables of numbers, so a very effective first tool is to just plot them in a run chart:

spc-prac-01.svg

It’s clear they go up and they go down – there’s noise in the number of calls Alice makes. This noise is called common cause variation in statistical process control. It’s variation contributed by causes common to all data points. In other words, if all data points are affected by the same noise, there’s no statistical way to tell them apart. If measurements are determined solely by noise – by common cause variation – we can say that all measurements are the same, in the sense that two dice are the same, even if one happened to land on a 2 and the on a 4.

I’ll press a bit further on this point, because it is central to understanding spc. If Susan and Joe each throw a die, and Susan’s lands on a 2 and Joe’s land on a 4, you wouldn’t say that Joe’s die was better than Susan’s. Neither would you call Joe a more skilled dice thrower than Susan. Although they are throwing physically different objects, we say that Susan and Joe had the same dice. They used the same process to throw them. Joe simply got lucky.

In the same way, you can’t really say that Alice has better or worse days for sales calling – they’re all the same in the sense that two dice is the same, she just gets more or less lucky.

Key Insight

You cannot judge the process by a single outcome. Only a long run of outcomes tell you anything about the process. Consequently, once you know what the process is like, any single outcome from that process adds nothing to your knowledge about the process.

Prediction

From the run chart, we can tell that Alice’s call numbers hover around 80. The evidence we have allows us to say, with at least a little confidence, that any given week, Alice will make 80 calls, plus/minus noise, or common cause variation. This is already more than we could say when we just compared against last week’s numbers!

What more can we say? Well, we know that roughly half the time Alice will make more calls than the previous week, and half the time she will make fewer. And about half the time she will make more calls than her average, about half the time fewer.

But! This is where it gets interesting.

Imagine that Alice’s manager is not happy with her performance, and sets a goal that she has to make 90 calls per week. We can tell ahead of time that if Alice just keeps doing her best, like she has up to this point, she will struggle to meet this goal. We can say this because because we know that roughly half of her weeks she makes fewer than 80 calls. A goal of 90 would only demotivate Alice, by asking of her something she clearly is unequipped to do.6 You might disagree, but bear with me here. We might get to your complaint later in the article. If we don’t, please email me and I will try to collect responses into a follow-up faq.

Even a goal of 70 calls would be an insult to Alice. Sure, she would fulfill it most of the time, but every week she does it would be pure chance. It’s like setting a goal that whenever you throw a die it must land on a 3 or above. Sure, it will happen most of the time. Will it make you feel accomplished? The specific number of calls Alice’s makes is due to common cause variation. She will meet or miss the goal, be happy or sad, based on a throw of the dice every week.

We know all of this because Alice has quite steadily made 80 calls, plus/minus common cause variation, per week. So unless she changes something fundamental about how she works, this is what she will continue to do. The specific number of calls any given week is determined by the common cause variation. It’s completely random. There is no information in that number.

Process Changes

Look, I’m not saying Alice will foverer make 80 calls per week. You could teach her techniques that improve her skill, which may allow her to work more efficiently and make more calls! What I’m saying is that setting a goal of 90 calls will not magically make her a better salesperson. That takes training.

You might interject that “I’ve been in that situation, and I can promise you Alice would start making more calls if she was given that goal – even with no additional training.”

Let’s look closer at that. There are three ways that can happen:

  • Alice might lower the quality to be able to improve quantity. This could be bad. It could also be desirable. Maybe Alice thought you preferred quality over quantity, when in fact it was the other way around. Next time, you can tell Alice that without setting any goals.
  • Alice could increase her call rate by sacrificing lunch, staying late, avoiding coffee breaks, and so on. Then she will go to the pub and tell all her friends about how shitty her boss is for forcing her to work so hard, and advise anyone not to work where she works. Eventually she burns out, quits, and you have to train her replacement all over. Is that really an improvement worth having?
  • Alice can fudge the numbers. She could dial up random numbers and hang up on them immediately, just to make her numbers look better. Is that going to bring revenue to the business? Is it going to make Alice happier?

Goals are wishful thinking, and do not on their own improve things – they only make things worse.

Amount of Common Cause Variation

Now we know what common cause variation is. We measure it with process behaviour charts, also known as control charts.

There are many types of process behaviour charts7 And with some statistical knowledge, you might be able to reach for the specialised ones in the right context., but the one you can almost always rely on to work is the individuals chart, or XmR chart.

Here’s how you make it for a time series, like Alice’s weekly call numbers. Start by plotting the values in a run chart, as before.

spc-prac-01.svg

Then, draw a line at the mean (average) value of that data: \(\overline{\mathrm{X}} = 84\).

spc-prac-02.svg

Now, compute the absolute differences between consecutive data points:

X 86 96 65 101 90 70 85 75
mR   10 31 36 11 20 15 10

The upper row contains Alice’s call numbers. The lower row has the differences between adjacent numbers8 These differences are sometimes called moving ranges – hence the name XmR chart. It is based on the X values themselves and their moving ranges.. The difference between the first two numbers (86 and 96) is 10, the difference between the second two (96 and 65) is 31, and so on.

Compute the mean of the consecutive differences: \(\overline{\mathrm{mR}} = 19\).

Here comes the magic. Compute the lower and upper natural process behaviour limits (also known as lower and upper control limits) as

\[\mathrm{LPL} = \overline{\mathrm{X}} - 2.66 \times \overline{\mathrm{mR}},\] \[\mathrm{UPL} = \overline{\mathrm{X}} + 2.66 \times \overline{\mathrm{mR}}.\]

Concretely, in the case of Alice, we get

\[\mathrm{LPL} = 84 - 2.66 \times 19 = 33,\] \[\mathrm{UPL} = 84 + 2.66 \times 19 = 135.\]

The number 2.66 seems like a magic constant, and it is. There’s a little information in the appendix on where it comes from. The important thing is that you use specifically 2.66, no more, and no less. It’s very important, because people will ask you to use something different, and you must not oblige.9 See below about The Voice of the Process.

Add to your chart lines for these process limits.

spc-prac-03.svg

These natural process limits indicate the range of sales calls numbers we can expect from Alice, assuming she doesn’t change anything fundamental about how she works. The process limits incidate the amount of week-to-week variation in her work process. They are a measure of the amount of common cause variation, the amount of noise.

In other words, you should not be surprised if Alice makes 35 calls one week, because that’s within the process limits. It’s just like rolling snake eyes when throwing dice. Rare, sure, but it does happen. It’s fully within the range of outcomes you should expect of the process. Similarly, some weeks she might get lucky and make 128 calls – nothing extraordinary has changed that week; Alice just got lucky within her regular process.

Alice’s call numbers are an easy case to deal with: they make up a stable process. We can predict her future performance based on the past. The name stable process is a bit confusing because a stable process is one where the outcome is completely random. A stable process is one where we cannot predict any individual value, but we know the range into which almost all values will fall.

In traditional spc litterature, stable processes are known as in (statistical) control or controlled. What is meant by this is that we have already controlled for all the significant external factors, and there is nothing left we can control to determine the outcome. We just have to let the process run on its own and produce what it is tuned for.

When you have a stable process such as this, you don’t have to re-compute the process limits each week. One of the defining features of a stable process is that any given week, statistically, looks like any other week. Because of this, you can just extend the process limits you have already computed indefinitely into the future.

The Voice of the Process

You can probably guess what Alice’s manager would say upon seeing these process limits.

What? Those limits are too wide! I can’t have a sales person that does anywhere between 30 and 130 calls per week. That’s unmanageable. Besides, 30 is just so bad. Will you re-compute those limits with a lower multiplier than 2.66? This is just not usable.

The harsh truth for Alice’s manager – and a lot of people – is that you don’t get to pick the process limits. The process limits is the system trying to communicate to you what it is actually capable of. You have to accept these limits because anything else is delusion. You can either listen to the voice of the process and get wiser, or ignore it and look like a fool.

Importantly, the system couldn’t care less about what you wish it was capable of. The voice of the process will only ever tell you what it is capable of, no more, no less. This is a deeper point than I have time to expand on here. When in doubt, find out what happens in practise and listen to it. Don’t get blinded by wishful thinking.

Improving Stable Systems

Let’s say you’re Alice’s manager and you’re unhappy with these limits. Instead of rejecting them, you accpt them. But you also want to improve on Alice’s performance. What do you do?

There are two ways we can improve a stable system:

  • improve the mean, or
  • reduce the variation.

Your instinct will probably be to improve the mean. After all, that’s the most direct indicator of overall performance level.

I would urge you to focus on reducing variation first. There are two reasons for this that stand out. First of all, variation has a direct cost in and of itself that people underestimate10 It makes planning harder, obviously. If other people depend on your work (in Alice’s case, that might be a fulfillment department), then bursty progress makes it harder for them to keep up too. The further down the chain we go, the worse the variation gets. In the end, delivery might be late for some customers because the shipping department has capacity planned for roughly even load, and what they got was very bursty.. Second, when variation is high, it is difficult to recognise improvements to the mean. With tight variation, it is easier to see even small shifts in performance.

Since the outcomes of a stable system are like throws of dice, you cannot improve a stable system by taking action on individual outcomes. If you try to get better at throwing dice by doing something different when you get a 2 and trying to keep doing what you’re doing when you get a 5, you will be disappointed. In the real world, you will actually make things worse.11 See the funnel experiment for a very primitive illustration of this.

Any improvement to a stable system requires following statistical trends, asking experts for advice, trying things out, and verifying underlying shifts in process behaviour.

To be abundantly clear: you cannot improve a stable system by tampering with individual outcomes. The only thing tampering accomplishes is destabilising the system. A stable system must be improved as a whole.

Here’s an example of what might happen if Alice’s manager starts trying to react to individual weeks:

spc-prac-04.svg

It looks like an improvement at first, and Alice’s manager will be proud. But then we see the effect was not a consistent improvement, it was just that the variation increased – the system is destabilised somewhat. Tampering with individual results often causes variation to increase.12 Note that here it even increased to the point where the lower limit is 0 – whereas we previously could be pretty sure Alice would make at least 30 calls, with the new micromanaging le of her supervisor, we should no longer be surprised at no calls being made at all. Don’t worry, she will have a good excuse.

If you can’t tamper13 Favourite activity of managers all over the world., what can you do? Improvements to a stable system must be driven by theory. You cannot just react to events – you must build a thorough understanding of what goes into the system and adjust things on as fundamental a level as possible. You need to approach your process as a scientist would a foreign life form: with curiosity and a burning desire to learn everything about it.

As you can guess, improving a stable system is difficult. Most people don’t bother. But there are massive rewards to be had here, even though it takes a little intellectual work.

Let’s say Alice’s manager takes our advice and thinks long and hard about how to improve the sales process. Maybe they find out that the salespeople aren’t really learning from their own experiences – they get stuck in their ruts. So the manager institutes a knowledge sharing programme, where the salespeople get together once a week and share what they’ve learned with each other, pass on tips about prospects, and so on. Alice’s numbers might instead turn out to be

spc-prac-05.svg

It’s too early to tell, but this could be a start of an improved level of performance for Alice. The rest of the salespeople may have a similar improvement as well, since by improving the system you improve the results for everyone.

Unstable Systems

So far, we have only looked at stable systems. What characterises stable systems is that their outcome is determined by noise, or common cause variation. This means we can’t know exactly how a particular measurement will work out, but we can be reasonably certain that it will fall within a predictable range.

Sometimes that’s not all we’re looking at, though.

Variation From Assignable Causes

We go back to the example with cloud costs, but again zoom out a little to get more context.

spc-prac-06.svg

As with Alice’s sales calls, we see that our cloud costs are higher some months, and lower some months. There’s noise in there. We can also see that the noise is smoother14 Technically, autocorrelated. This makes sense: if our cloud costs are high one month, it’s likely they’ll also be a little high next month, and vice versa.

Is there any cause for alarm here? The first 12 months look like a somewhat stable process, so let’s compute process limits from those, and slap those limits onto the chart, to turn it into a process behaviour chart.

spc-prac-07.svg

The process behaviour chart has determined that there are two months where our costs exceeded the natural process limits. How do we interpret that?

The process limits indicate the range of numbers which we can expect to see just from common cause variation alone. The peaks in months 13 and 18, in other words, are so extreme that it’s unlikely they come from common cause variation alone. The stable process that generated the costs in all other months is probably not the same process as that which generated the costs in months 13 and 18.

In other words, we suspect there’s similar but slightly different processes that generated the costs for months 13 and 18 (and maybe even 12 and 14). These different processes are trying to masquerade as the regular one, but the process behaviour chart revealed it.

If we ask the infrastructure team about month 12–14, and they will quickly tell you that one of their biggest users had an event around that time, which required more cloud resources than usual. So now we know what process (unusual event) generated the costs of months 12–14, and it was indeed slightly different from the ordinary one.

We ask about month 18 too. They investigate a little, and find out that someone forgot to turn off a bunch of test instances that were no longer needed. Again, a slightly different system than the regular one generated the costs for month 18, and the process behaviour chart revealed it.15 This is one of the superpowers of statistics in general, and spc in particular: you can tell if something is off with data you’re looking at, without knowing anything about the underlying processes.

This has illustrated one of the signal detection tests you can do with process behaviour charts: if any measurement is outside the natural process limits, it’s likely to belong to a category known as assignable causes of variation, i.e. there is a specific thing you can point at and say that “this made it happen”.

What’s the point of this? We’ll get to that after we’ve talked about baseline shifts.

Baseline Shifts

What about Bob and his four-day bug fix? Again, we zoom out and look at it in context.

spc-prac-08.svg

First of all, check out how often we break that arbitrary three-day target! About 15 % of the time. And if this is a stable system, then we can expect that we will always break that three-day target about 15 % of the time. Does a good manager set a target that they can tell ahead of time the team will not meet 15 % of the time? Probably not.

Is it a stable system? Get them natural process limits computed.

spc-prac-09.svg

It is almost a stable system. Bug #10 seems to be an outlier. If we ask the development team about it, we will probably learn that it was special in some way we could have foreseen if we had known to look for it. If we control for that, the system is stable.

Another signal detection test we can do with process behaviour charts is see if there are 9 points in a row on the same side of the average. And indeed, this has happened: starting at bug #34, there are nine bugs in a row that took longer than average. This usually indicates a smaller shift in baseline performance. Could something have happened there?

We look at the detailed record of bugs fixed, and we notice that the development team started working on more bugs in parallel around bug #34 – and this, of course, makes each bug take longer to fix. If we ask about it, we learn that there’s management pressure to take bugs off of the open queue faster.

Again, the process behaviour chart allowed us to detect a signal that we might not otherwise have noticed. Incredibly, the tight specification limit of three days also did not reveal this shift.

Improving Unstable Systems

Now we know how to

  • recognise stable systems (variation is within natural process limits);
  • predict future outcomes of stable systems (individual outcomes impossible to predict, but we can be reasonably confident of a range of possible outcomes);
  • improve stable systems (it’s complicated and takes understanding); and
  • recognise unstable systems (variation exceeds natural process limits, or has excessive runs that are higher or lower than the average).

Can we predict future outcomes of unstable systems?

No.

In an unstable system, anything can happen. Predictions are meaningless at best, and outright dangerous in some cases. There are no statistical regularities you can count on. Thus, the first order of business when you have an unstable system is to stabilise it. This is the only improvement you can make with an unstable system.

Fortunately, improving (stabilising) unstable systems is conceptually simple (though not always easy!). Find the assignable causes of variation and baseline shifts. These are almost always caused by something specific. Figure out what that specific thing is, and either remove it if possible, or control for it, by e.g. detecting it ahead of time and accounting for it so that everyone is prepared it is going to happen.

Once you’ve removed or controlled for the assignable causes of variation, you have a stable process. Now you can extrapolate performance into the future, or start working on systematic improvement of variation and baseline. But if you have variation from assignable causes, or baseline shifts, the first thing to do is get rid of those so you get a reproducible process. Only then can you start talking about systematic change in a sensible way.

Other Types of Instability

In addition to the two signal tests above, it can also pay off to informally look for any pattern or predictability in the measurements. For example, strictly alternating high and low values means there’s probably assignable causes of variation there. Other types of cyclic behaviour sometimes also stand out, such as every sixth measurement being higher than the others. Steadily increasing or decreasing measurements will usually trigger an out-of-limits test eventually, but sometimes you can catch it quicker by looking directly for it.

There is also large set of formal signal detection tests available for process behaviour charts. If you look it up, you will likely come across references to the Western Electric rules16 Statistical Quality Control Handbook; Western Electric Co.; 1958..

Using many rules is often missing the point: the idea is not to detect all signals – if you try to do that, you will treat a lot of routine, common cause variation as signal, and you will need to investigate way more events than you have time for. The idea is to balance false positives with false negatives. Using the two rules I’ve stated above, you hit a good balance.

Use as few rules as you can get away with while still having something to investigate.

Other Topics

There are some other things that are worth mentioning, to prevent confusion or misunderstanding. Here they are, in no particular order.

Common mistakes

The absolute most common mistake people make is acting on noise as if it was a signal. Trying to fix a stable process by tampering with individual outcomes does not improve things, and long-term likely makes them worse. It destabilises the process, increases variation, and makes the outcome less predictable, not more.

The opposite mistake also sometimes happen: some people pretend their system is stable, even though they have obvious, unaccounted-for variation from assignable causes. If there is assignable cause variation, you don’t have a process to improve – you have multiple interleaved processes you need to disentangle first and handle separately.

When people learn about process behaviour charts, one of their instincts is often to transform the data before plotting it. This is a mistake. The process behaviour charts I’ve explained here are robust for most kinds of data you will find in real life17 There are exceptions! Especially in the software industry, because software processes are notoriously unstable (multi-modal) to the point where process behaviour charts become difficult to interpret.. Transforming the data before plotting it is likely to hide signals.

Similarly, you should avoid thinking of your data as coming from a theoretical probability distribution. If you start with that asumption, you’re likely to miss important signals. After all, signals (in the spc sense) is your process trying to tell you that it’s really multiple processes interleaved. If you’re assuming a single theoretical probability distribution, you’re assuming away the very thing you’re looking for!

Not Only Timeseries

Process behaviour charts are not useful only for time series. They’re very common for time series, but you can also apply them to other things. Let’s say that the average monthly travel expenses as reported by seven people are

Steve Gloria Celine Robert Kim Charlie Fred
$532 $424 $329 $475 $190 $490 $539

If we want to know whether one person spends significantly more or less than all the others, we have too few data points to just compute the limits across all seven people. What we can do is create one process behaviour chart for each person, where that person is excluded from the limit computation.

If we do this for the travel expenses, most plots will be unremarkable – except the one for Kim, shown below.

spc-prac-10.svg

In other words, creating process behaviour charts for non-timeseries is not a problem. The one thing to look out for is how you have ordered your data. For example, data that are sorted from smallest to largest will not work, because we need the differences between consecutive data points to reflect the point-to-point variation of the process. A random order always works.18 When dealing with data points attached to people, a common trick is to order the data points alphabetically by name. This is in practise the same as ordering them at random (since we can think of names as randomly assigned to people) but looks nicer in a report.

Examples of Stable Systems

You find stable systems everywhere.

  • The change in revenue in a company is usually fairly stable from quarter to quarter. There is nothing you can learn from what it is this quarter in particular, and improving the quarterly revenue increase cannot be done by incentivising individual quarters. It takes whole-system improvement.
  • The monthly availability of a software service service is often fairly stable. Just patching over symptoms from individual outages will not improve things. Improvement takes thorough understanding of the systems involved.
  • The daily number of bus departures that get canceled.
  • The yearly amount of rain in a location (stable, but at very different levels and with very different levels of variability depending on where you are on the planet!)
  • The number of first-time criminals every month in a society is usually fairly stable. Effectively, we are tuning society to produce a certain amount of criminals per month. We shouldn’t be surprised over fluctuations in this number. Improving involves deeper understanding and whole-system changes. Punishing individual criminals leads to no long-term improvement.

Appendix A: Theoretical Notes

Most of everything above is a simplification and lacking in theoretical rationale. I won’t go into all details because I don’t want to write a book, but there are some points that I think are more important.

Why Are Stable Processes Random?

Early in the article, we said that a process cannot be judged by individual outcomes, and that once we know what the process is like, individual outcomes teach us nothing. This is of course not strictly true: there’s just a diminishing amount of information we get from further observations. Once we have 20–30 observations, yet another observation from the same process is incredibly unlikely to statistically deviate much from the observations you have already. This depends a little on what type of process we’re looking at, as well as what the time scales are, and whether there is any aggregation involved.

It might sound confusing that the outcomes of a process in control is determined by chance. As a reminder, in control means that we have extracted all of the powerful predictors from the process. What remains are a myriad of small factors affecting the outcome this way and that way. Since they are numerous and small, they all sort of cancel out and leave only a small residual effect, which is the noise we’re finding.

More concretely, in the case of Alice, we can imagine any number of factors that affect how many sales calls she manages to do:

  • which colleagues she runs into at coffee machine,
  • the number and duration of bathroom breaks she needs during the day,
  • what type of lunch she has,
  • whether or not she needs to take a personal call during the day,
  • bad sleep the night before,
  • time for exercise in the morning,
  • a spat with her partner,
  • sick children,
  • a flat tyre on her way to work,
  • a hangover,
  • seeing a very good movie the night before,
  • the first call is an annoying prospect,
  • her ip telephony bugs out,
  • annoying noises outside her office,
  • a doctor’s visit she needs to sneak in during the day,

and so on. All of these, and many more, affect her call rate in small ways. Some positively and some negatively. We cannot account for all of them, in part because we can’t even imagine what some of them might be, or in which direction their effect goes. We simply lump all of them together into one category, and call it common cause variation. We account for it by assuming it will always be there, and everything taken together, it will be roughly constant unless Alice changes something fundamental about her process.

The above also leads to something known as the report card effect: if you try to aggregate too many physical processes into one summary metric, that metric will always be a stable process, meaning it loses its power as an indicator of when something goes wrong. You must look into processes in reasonable detail in order to have meaningful metrics. If you summarise too many things into one number, you average out all the useful signals into noise.

Why Not Standard Deviation?

We measure the common-cause variation in a very particular way: we take the average of the consecutive differences between outcomes. In other words, we measure the point-to-point variation quite literally as the difference between consecutive points in our data. Not only is this an intuitive way to quantify it; it is also very robust against patterns in the data, such as cyclic data.

A common response when learning about the consecutive differences is to suggest another common measure of dispersion: the global standard deviation. The problem with the standard deviation is that it measures how spread out the entire data set is. By looking at the dispersion of the entire data set, we are looking at variation that includes assignable cause variation – we are overestimating the point-to-point variation.

We are interested in knowing the component of dispersion that comes from point-to-point variation of common causes. We capture this more accurately by looking specifically at differences between consecutive points, rather than the difference between each point and the global mean.

Magic Constant 2.66

This leads nicely into where the magic constant 2.66 comes from.

While we don’t want to measure dispersion as the standard deviation of the observed values, because the values we’ve observed may come from different interleaved processes, the general idea of the standard deviation is still useful. It’s useful because of Chebyshev’s inequality, which tells you the fraction of samples that it is theoretically possible to find outside of a multiple of the standard deviation – regardless of the underlying distribution.

The way the computation for the standard deviation is constructed, Chebyshev’s inequality guarantees that no more than 11 % of samples fall outside of three standard deviations – in the absolute worst case. The closer the distribution is to normal, the fewer the observations that will fall outside of three standard deviations. When the distribution is normal, this will be just 0.3 %.

In the long history of spc, most practitioners and statisticians have found three standard deviations to be a good balance between false positives and false negatives, for most kinds of data.

The problem is that we can’t measure the standard deviation of our observations directly, because that would involve assuming that all observations came from the same process, which is the question we’re trying to answer. We do have the mean of the absolute value of the consecutive differences, but that isn’t the standard deviation.

However, the mean value of the consecutive differences is roughly 1.128 times larger than the standard deviation! This depends on the process, and as you can see from the following draws from theoretical processes, that’s not always the value it converges to. But it’s the case often enough that we use it anyway.

spc-prac-11.svg

In particular, 1.128 overestimates the convergence point for heavy-tailed distributions19 You can see it begins already at the exponential distribution, which by definition converges at 1.000, and then for subexponential distributions like Lognormal(1,1) it just gets worse.. This means for heavy-tailed distributions, we will set our limits too close to the mean. However, with heavy-tailed distributions, we are also more prone to see values closer to the mean (this is indeed why we end up underestimating the variation), which, very informally, cancels out part of the problems of overestimating the convergence point.

So we have the following relationships:

\[\sigma = \frac{\overline{\mathrm{mR}}}{1.128}\]

Thus, the limits we’re after are,

\[3\sigma = \frac{3}{1.128} \overline{\mathrm{mR}} = 2.66 \overline{\mathrm{mR}}.\]

Appendix B: What is an average?

A reader objected to this article claiming that the average is in the middle of the data, because that’s not a mathematically true statement. They raised some good points which go beyond the basic criticism, so I feel it worthwhile address them here along with some background context.

First, what is an average, more generally? The word average is not well defined. It’s commonly taken to be a synonym for arithmetic mean, sometimes even in error, where other measures of averageness would be more appropriate – commonly the median. If we take the broadest view I can think of, then average is a synonym for associative mean. The idea of an associative mean is the result of any function \(m = M(x_1, x_2, x_3, ...)\) where we can replace any value \(x_i\) with \(m\) and it doesn’t change the equation. This would result in \(m\) being, in some sense, representative of all data.

But! Notably, this property holds also when \(M\) is the maximum function! If we replace any \(x_i\) with the maximum of all \(x_i\) the maximum does not change. So the maximum value is a kind of associative mean. So “an average”, in the most general, associative sense, is not at all guaranteed to be near the middle of the data – it could just as well be an extremal value.

And even if we limit ourselves to the arithmetic mean, there’s no guarantee it is close to the middle of the data either (whatever we mean by “middle of the data”). The same reader brought up an example of a distribution which is just barely stable in the spc sense20 I would expect a stable distribution to have a false positive rate of around 99-point-something percent with XmR charts, but this distribution had a 98 % false positive rate., but still has a mean that is four times larger than the median. Apparently that type of data exists, but I think it is rare in practise.

All of that to say: in this article we’re not aiming for mathematical correctness, we are going for intuition. When we hear about some data in relation to its average, e.g. on the news, or in a company-internal status report, the average (whichever average is used) will very often be near the middle of the historical data. This is not a mathematical statement, it is an empirical one. It has been true in my life, it has been true in the life of Deming, and I would be surprised if it was not true also in your life.

The reason I claim that the average is near the middle of the data is that many people think of the average as prescriptive. They imagine that the average is a value that expresses what things ought to be like. By instead stressing that the average is descriptive, I’m trying to force a change of perspective that is beneficial for reasoning about variation, or noise.

In other words, it was never meant as a mathematical statement, but rather a suggestion of a behavioural and cognitive improvement. As a writer for practitioners, I believe a little slop in rigour makes the point more effectively.

Referencing This Article