Two Wrongs

Extreme Value Statistical Process Control

Extreme Value Statistical Process Control

This is not an article that will teach you something. This is an article about something we – as humans – don’t know. We don’t know how to apply statistical process control to extreme value processes.

This is particularly relevant to software development, because in software we see a lot of extreme value processes.

Extreme value processes can be recognised visually

The difference between thin-tailed and extreme value processes is probably most easily illustrated by run charts. This first chart shows the number of weekly deployments to production in an organisation. This is a stable process affected only by routine variation, and it is a thin-tailed distribution.

ev-spc-01.svg

Once the product is deployed, we perform a smoke test by issuing 50 requests to it and record the response time. We get the following measurements.

ev-spc-02.svg

This is also a stable process affected only by routine variation, but there’s something different about it. In many ways it is similar to the deployment frequency (similar means, similar number of measurements below 15, etc.) but one important difference is that in one case, the maximum contributes 5 % of the mean, and in the other 20 %.

Statistical process control assumes thin tails

In extreme value processes – even when they are stable – the maximum value observed can be a significant chunk of the total value, or mean. This poses a problem for traditional spc tools, because they assume thin tails. Look at what happens if we create an XmR chart from the response times:

ev-spc-03.svg

Even though there were no changes to the system when these measurements were taken, the XmR chart signals assignable causes of variation. This happens with other types of control charts, like xbar-and-R charts, c-charts, and u-charts.

Computers, and software, give rise to several extreme value processes like this. We can’t apply traditional spc techniques to study those.

Potential responses

I have collected some responses to this problem, none of which satisfy me so far.

What’s the problem?

Some people will argue that any type of control chart with any data will exhibit some false positives, and that this isn’t a problem. We investigate, we find no cause of assignable variation, and we move on.

The difference in the software world is twofold: one luxury problem, and one characteristic of extreme value processes.

  • The luxury problem is that software generates a lot of measurements. If we have a 1 % false positive rate on response time, and 400,000 requests per day, that’s 4,000 potential signals to investigate every day. Not feasible.
  • The characteristic of extreme value processes that complicates things is that they generate a large amount of values near the mean, which narrows the the process behaviour limits to a smaller range than we would expect, which means the proportion of false positives also increase.

A potential solution to both of these problems is to take only a small sample from the data; plug twenty of the 400,000 daily response times into a control chart. This reduces the total number of false positives we will investigate, and if there is an extreme value among only 20 rather than 100, it will help widen the process behaviour limits a little, reducing the proportion of false positives also.

The big problem with this solution is that small samples are just not representative of extreme value processes, precisely because they throw away information about extreme events. I don’t know what it is we would be controlling with this approach, but it would not be the extreme value process we are interested in.

But there are assignable causes of variation!

The most traditional response from spc practitioners (including Donald Wheeler when I asked him about this) is that the response time chart does show assignable causes of variation. The software doesn’t just suddenly go slower on its own. There’s something else happening in the system that slows down some responses.

And that’s actually true. There are garbage collections going on which freeze the system momentarily and this makes some requests much slower than others. If we separate out “response time with no gc” from “response time with gc” we would get two processes that both are in control.

From this perspective, the XmR chart did exactly what it was supposed to: it showed us that there are really two interleaved processes masquerading as one.

The problem is that it’s not practical to separate out the non-gc responses from the gc responses. From our birds-eye view, we need to be able to treat the application as unchanged from deployment to deployment. It’s a meaningless operation to control for gc behaviour because it’s part of the application as it runs.

In other words, this is a case in which we need to treat a mixture of two distributions as one, and this is precisely the thing control charts were designed to refuse to do!

We could transform the data

Many of the extreme value processes in software can be fairly well fitted to a log-normal distribution. If we squint a little with our brains, we can pretend that this means they are still composed of routine variation, only the routine variation is multiplicative instead of additive. This may even feel intuitive under the perspective that software has a fractal nature.

If this is true, then we should be able to take the logarithm of our measurements and construct the control chart from that. (This was Donald Wheeler’s second suggestion when I asked, although he couched it in all sorts of warnings! and rightly so.)

I’m also not happy with this solution, primarily because it doesn’t actually solve the problem in the case of a lot of data with a small tail exponent (i.e. data that has a very fat tail, in layman’s terms.)

Measure the time between extreme events

What sets apart extreme value processes is not so much the frequency of extreme events1 In some sense the frequency may even be considered to be lower than for thin-tailed processes. as their size. If the size is the problem, we could ignore the size of extreme events, and instead focus on the time between them, where we have defined some threshold to indicate what counts as an extreme event.

If we choose a sufficiently high threshold, the events are Poisson, i.e. the time between them should be roughly exponentially distributed. This is again something traditional spc tools can handle.

I haven’t evaluated this approach myself yet. My main concern is that it, again, throws away valuable data about tail behaviour. (Recall that dichotomising data is equivalent to throwing a third of it in the rubbish bin, in terms of statistical power.) But it’s certainly the least bad we have seen so far.

Create a new type of process behaviour chart

I feel like it should be possible to design a new type of process behaviour chart that can handle extreme value processes. There is a branch of statistics called extreme value theory, or evt, which among other things, dabble in extreme value processes. For a good technical introduction, I recommend Modelling Extremal Events2 Modelling Extremal Events: For Insurance and Finance; Embrechts et al.; Springer; 1997. – to no little degree because it is recommended also by Nassim Nicholas Taleb (who you may think what you want of, but he knows his extreme value processes).

Maybe one can borrow knowledge from there to create a type of control chart that is usable with large quantities of extreme value data?

Here is, for example, a run chart over 50,000 measurements of response time.

ev-spc-04.svg

If we want a traditional control chart to be useful, we’d have to either sample or take the mean of large groups of response times. We’d get something like this.

ev-spc-05.svg

Despite averaging out large numbers of requests, this still indicates a signal at point 8, which investigation would reveal was not due to an assignable cause.

What if we instead fit a generalised Pareto distribution to large blocks of response times, and plugged the estimated tail exponent into an XmR chart?

ev-spc-06.svg

Now we’re talking! There’s a big baseline shift at around t=10, which eventually exceeds the process behaviour limits. And indeed, around halfway through the day there was a change in the behaviour of the response time! At this time a stabilising change was deployed. (Higher tail exponent means better-behaved distribution.)

We could notice this change because we charted a value that’s commonly of interest to extreme value theorists: the tail exponent. But really, a control chart of an estimated parameter of a distribution fit to blocks of data? It feels a little complicated, and it’s not very creative. But my stats-fu is also too weak to do better.

I guess if this article has a point, that would be it: go out there and figure out how to do statistical process control on extreme value processes. Then report back. I would really like to know, and in 50 years, humanity may thank you.

Appendix A: More examples of extreme value processes

Here are some other examples when we might encounter this problem:

  • The sizes of downloaded files for a user. Most are small, but there are usually a few that are really big and dominate total bytes downloaded.
  • The financial impact of feature ideas: some ideas are responsible for an outsize amount of product revenue, and most are near-meaningless at best.
  • The amount of monthly payments by different customers. Most customers pay us a fairly trifling amount in the grand scheme of things, but then a few customers bring in a large amount of the revenue.
  • The time between successive peaks in memory usage of our application. One peak is often followed shortly by another, but then suddenly there’s a huge gap until the next peak.