Two Wrongs

It Takes Long to Become Gaussian

It Takes Long to Become Gaussian

There’s a popular LessWrong post about how quickly the central limit theorem overpowers the original shape of a distirbution. In some cases, that is true – but the conclusion of the article concerns me:

This helps explain how carefree people can be in assuming the clt applies, sometimes even when they haven’t looked at the distributions: convergence really doesn’t take much.

This claim carries some risk, because assuming an inappropriate theoretical distribution shields your reasoning from reality.1 Conditioning on a hypothesis based on a theoretical distribution is a way to simplify calculations but it also establishes a worldview where plausible outcomes are practically impossible. Both of these characteristics of theoretical distributions are especially true when the distribution in question is the Gaussian.

One would hope that the claim in the post would then come only after a thorough look at the evidence. Unfortunately, the post relies heavily on eyeball statistics and uses a cherry-picked set of initial distributions that are well-behaved under repetition.

Summary

A lot of real-world data does not converge quickly through the central limit theorem. Be careful about applying it blindly.

Now, let’s look at some examples and some alternatives to the clt.

Real-world data can converge slowly

The LessWrong post was filed under forecasting, so let’s take our forecasting hats on, and examine its claim using real world data. Since the LessWrong post indicates that 30 of something is enough to establish a normal distribution, I will talk about sums of 30 things.

Will a set of files in my documents directory be larger than 16 MB?

Imagine for a moment that we need to transfer 30 files from my documents directory on a medium that’s limited to 16 MB in size.2 Maybe an old usb stick. Will we be able to fit them in, if we don’t yet know which 30 files will be selected?

The mean size of all files is 160 kB, and the standard deviation is 0.9 MB. Since we are talking about 30 files, if the LessWrong post is to be trusted, their sum will be normally distributed as

\[N(\mu = 30 \times 0.2,\;\;\;\; \sigma = \sqrt{30} \times 0.9).\]

With these parameters, 16 MB should have a z score of about 2, and my computer3 Using slightly more precise numbers. helpfully tells me it’s 2.38, which under the normal distribution corresponds to a probability of 0.8 %.

So that’s it, there’s only a 0.8 % chance that the 30 files don’t fit. Great!

Except …

clt-convergence-01.svg

In this plot, the normal density suggested by the central limit theorem is indicated by a black line, whereas the shaded bars are the actual sizes of random groups of 30 files in my documents folder4 Using find and perl to get it into R..

There are two things of note:

  1. The plot extends out to 50 MB because when we pick 30 files from my documents directory, one plausible outcome is that they total 50 MB. That happens. If you trust the normal distribution, you would consider that to be impossible!
  2. If we count the area of the bars beyond 16 MB, we’ll find that the actual chance of 30 files not fitting into a 16 MB medium is 4.8 %.

The last point is the most important one for the question we set out with. The actual risk of the files not fitting is 6× higher than the central limit theorem would have us believe.

If we adhere to the Gaussian hypothesis, we will think we can pick 30 files every day and only three times per year will they not fit. If we actually try to do it, we will find that every three weeks we get a bundle of files that don’t fit.

Will a month of S&P 500 ever drop more than $1000?

Using the same procedure but with data from the S&P 500 index, the central limit theorem suggests that a drop of more than $1000, when looking at a time horizon of 30 days, happens about 0.9 % of the time.

clt-convergence-02.svg

This fit might even look good, but it’s easy to be fooled by eyeball statistics. If we count the proportion of the area under the bars we discover the real probability is 1.9 %, i.e. more than twice as often. Also note that in the real data5 Downloaded from Yahoo Finance., a month with a $3000 drop is possible, contrary to what the normal distribution would suggest.6 All of this said, I have to admit that the normal distribution fits this data closer than I would have expected. Random walks in finance are notoriously jumpy and rarely converge quickly according to the central limit theorem. (See Taleb, Mandelbrot, and others for more on this.) I think there are two reasons our S&P 500 data is well-behaved: it’s a composite of multiple stocks, meaning we’re not really looking at 30 things but 15000 things, and the data comes from a relatively stable period (year to date.) Note the consequences of this: fairly well-behaved data that optically looks like a decent fit can still result in a 2× error in the tail!

Do we need to study more than 7000 km² of Finnish lakes?

Maybe a bit convoluted, but imagine we’re studying the wildlife in Finnish lakes. There are a lot of Finnish lakes, so we’ll study a random sample of 30 of them only. Our budget only covers 7000 km² of lake – what’s the risk we blow our budget?

Going by the clt, the risk is a measly 0.4 %. That we can live with.

clt-convergence-03.svg

But according to the real data7 Sampled from Wikipedia., the probability is over 3× as high, at 1.4 %.

Much like in the previous example, this might look like a good fit if we just eyeball it. But if there’s something we should eyeball, it’s the tails. The fit is not good in the tails. That’s what will cause us problems down the road.

Will the next 30 armed conflicts cost more than 250,000 lives yearly?

According to the clt, the probability of the next 30 armed conflicts costing more than 250,000 lives yearly is 2.5 %, but according to the real data8 Also Wikipedia., there’s a 4.8 % risk.

clt-convergence-04.svg

Again, we underestimate the risk by almost 2× when we blindly use the central limit theorem. In this case, the Gaussian is also visually a bad fit.

Use the real data if you can

So here’s my radical proposal: instead of mindlessly fitting a theoretical distribution onto your data, use the real data. We have computers now, and they are really good at performing the same operation over and over – exploit that. Resample from the data you have!9 To find out the risk of 30 files not fitting onto a 16 MB medium, draw many random sets of 30 files and record how their total size is distributed. That way you’ll have a pretty good idea of how many of those sets are larger than 16 MB.

Blindly slapping a normal distribution on things is a convenience from the time when we didn’t have fast computers, because the normal distribution has nice theoretical properties that make pen-and-paper analysis convenient.

Keep Cantelli’s inequality in mind

Sometimes we’re not able to resample. Maybe we need to mentally estimate something on the spot, or our phones are low on battery so we can’t use them to resample.

If we must draw conclusions only from the first two moments (mean and variance), there are more robust ways of doing it than assuming a normal distribution. One alternative is Cantelli’s inequality, which says that the probability mass higher than k standard deviations from the mean is at most

\[\frac{1}{1 + k^2}.\]

For example, at most \((1 + 1^2)^{-1} =\) 50 % of the data can higher than one standard deviation, at most \((1 + 2^2)^{-1} =\) 20 % can be higher than two standard deviations, etc. This bound is absolute – it comes out of the definition of the standard deviation.10 There’s also a two-sided version of Cantelli’s inequality, called Chebyshev’s inequality. If you’re faimiliar with the 68–95–99.7 rule which assumes a normal distribution, you might want to also learn the Chebyshev version of it: 0–75–89, which is true under any distribution. Perhaps surprisingly, Chebyshev’s inequality tells us there can be distributions where none of the data are inside one standard deviation!

9-sigma movements can happen

In connection to the 2007–2008 financial crisis some executives complained that they were seeing “price movements that only happen once every hundred thousand years.”11 Something like that – I don’t remember the details. This makes it seem like they were modeling price movements using a normal distribution, and saw 9-sigma movements.

If only they had known Cantelli’s inequality. Since we do, we can tell that 9-sigma movements can happen as often as four times a year. They usually don’t, but it would be imprudent to not consider the possibility.

Here’s a handy table:

Sigma Pr. exceeding (normal) Pr. exceeding (absolute)
1 16 % 50 %
2 2.3 % 20 %
3 0.14 % 10 %
4 0.003 % 6 %
5 0.00003 % 4 %

Practical statistics often uses three standard deviations as the limit for what’s normally possible. That is reasonable under a normal distribution, where only a fifth of a percent of data points fall beyond it. It’s not true in general, though, because we can find as many as one in ten data points outside of three standard deviations if we’re unlucky.

If you need to, blend Cantelli with Gaussian

Since the Cantelli inequality is true even for the most extreme data, it’s necessarily quite conservative. In most real-world cases, we won’t see 4 % of values outside of five standard deviations. A mental heuristic I sometimes apply is to assume a value somewhere between the Gaussian probability and the Cantelli bound. The more reason I have to believe the distribution is Gaussian, the closer to that bound I pick.12 This process is just as unscientific as it sounds like, and it looks like it could be possible to extract better heuristics from the Wikipedia page for Chebyshev’s inequality.

This data was deliberately picked on suspicion of heavy tails

As you may have guessed, I deliberately selected this data because it was a type of data I was fairly certain would converge slowly. Things like file sizes, stock movements, sizes of bodies of water, and severity of armed conflicts exhibit heavy-tailed behaviour, meaning it’s very common to find central values, but it’s also surprisingly common to find values really far out in the tail.

We could go into much more detail on the consequences and properties of heavy-tailed data, but that’s beside the point for this article. Be aware that heavy-tailed data exists, and if there’s any chance you are dealing with it, be careful about assuming the central limit theorem converges quickly.

Heavy tails converge slower than you think

As an illustration of how slowly heavy-tailed data converge, here are 1000 convolutions of the file size data.

clt-convergence-01-2.svg

It starts to look better, but the tails are still not a great fit. Quite a ways away from the 30 suggested by the LessWrong article!13 And this is from a fixed set of 560 files – if we were dealing with an actual theoretical distribution able to generate an infinite number of different values, it would converge even slower.

Don’t assume it is Gaussian just because it looks like it

When hearing “the more reason I have to believe the distribution is Gaussian” it can be tempting to eyeball the data in front of us and see if it sort of looks like a bell curve. That is a mistake. Heavy-tailed data generate disproportionately two things, compared to the normal distribution:

  1. Values far out in the tail, and
  2. Central values.

In other words, a heavy-tailed distribution will generate a lot of central values and look like a perfect bell curve for a long time … until it doesn’t anymore. Annoyingly, the worse the tail events, the longer it will keep looking like a friendly bell curve.14 Intuitively, heavier tails means probability mass is shifted from the shoulders of the distribution into the center and tails, but more into the center than the tails, on aggregate. However, the tails still do expand. This is the sort of thing that might take an animation to explain properly, but I don’t know how to make those.

The best way I know to figure out if data is heavy-tailed or Gaussian is to be careful and apply theoretic knowledge judiciously. Defer to real-world data when possible.

Thin-tailed distributions

Many of the theoretical distributions we know, including every single one used in the LessWrong article, are classified as thin-tailed. This means their tail decays faster than or equal to the exponential distribution, which is the threshold between thin- and heavy-tailed distributions.

This explains why the author found the distributions converged so quickly according to the central limit theorem – thin-tailed distributions are very well behaved under repetition: you get the central values and maybe something shallow into the tails, but never a deep tail event. Those deep tail events are what makes the heavy-tailed distributions converge very slowly.

Referencing This Article