Two Wrongs

Generating Almost Normally Distributed Values

Generating Almost Normally Distributed Values

Every now and then I want to generate normally distributed values in something like Perl, which does not have built-in functions for it. Unfortunately, there’s no easily remembered way to generate numbers from the normal distribution. I just learned the best way to do it: don’t. Generate logistically distributed values instead.

Seriously! Look at it: the bars are logistically distributed values, and the curve is the normal distribution. That’s absolutely close enough for any practical purpose I can think of.1 The normal distribution occurs frequently in proofs of exact mathematical relationships. In the real world, uncertainties around which specific distribution is appropriate are often large enough that an approximation is just as good as the real deal.

gen-norm-01.svg

The reason the logistic distribution is so much better is that it’s easy to remember how to draw from it. Here’s the algorithm:

  1. Generate a uniformly distributed value between 0 and 1. Almost all standard libraries have functions for this. Call this value \(p\). It will be the quantile we select from the logistic distribution.
  2. Compute the log-odds corresponding to that quantile as such:

    \[\ell = \log{\frac{p}{1-p}}.\]

  3. Give it the correct standard deviation by multiplying with a magic number2 Well, 0.55 is an approximation. The real magic number is \(\frac{\sqrt{3}}{\pi}\). But I don’t know how I’d remember that so 0.55 is easier., giving us \(x = 0.55 \ell\).
  4. Done! Now \(x\) is a draw from almost the standardised normal distribution.

If we want a different mean and/or standard deviation, we scale up the same way we would if it was a real normal draw, i.e. \(\mu + \sigma x\). In other words, given a uniformly distributed \(p\), the complete process is

\[x = \mu + 0.55 \sigma \log{\frac{p}{1-p}}.\]

That’s it.

Logistic distribution has fatter tails

As we can see from the plot above, the logistic distribution has slightly fatter tails than the normal distribution. Here’s zoomed in to the region beyond two standard deviations. Grey is still logistic distribution, black is normal.

gen-norm-02.svg

This means when drawing from the logistic distribution we will get slightly more central values, and slightly more extreme values. This is usually a good thing, because few real-world processes are as well behaved as the normal distribution.

Drawing from the actual normal distribution

The problem with drawing from the actual normal distribution is that we don’t have a closed-form way of doing it. It needs to be done numerically. For details, see the Wikipedia page, section Computational methods. It’s not that these methods are complicated, it’s just that I have no chance of remembering how to do it3 Well, except perhaps the Irwin–Hall approximation. and I hate having to look up basic algorithms.

I can easily remember

\[x = \mu + 0.55 \sigma \log{\frac{p}{1-p}}\]

for the rest of my life – especially now, having written this article!

Spot the true normal

In case you still don’t believe this is a good idea, here’s your chance to prove me wrong. Half of these plots are drawn from the real normal distribution, whereas half come from a logistic distribution. Guess which one is which! You get a generous 80 samples from each.

gen-norm-03.svg

I’m happy to receive guesses. Why don’t we set up a scoreboard?

Guesser p-value
Nick 0.17
jocke-l 0.17
kqr 0.50
modeless 0.83

The p-value is the probability of guessing that many correct (or more) if one truly has no idea which is which. In other words, I got a few right, but this result or better would have happened by chance 50 % of the time even if I had no way of telling them apart. Not a very impressive result.