# Beta, Gamma, and Pareto Conjugate Priors

I am very grateful to David Robinson for writing his article on Understanding the Beta Distribution (Using Baseball Statistics). That sort of intuition for Bernoulli trials has been useful for mentally modeling things.

Since reading that (a while ago) I have also been able to capture similar intuition for the Gamma and Pareto distributions. I want to share this with you!

# Beta Distribution

I suggest you read the (very brief) article by David Robinson referenced in the introduction, but I’ll present another take on the same basic idea.

Imagine you flip a coin and you have been seeing the following pattern: Heads, tails, tails, tails, tails, heads, heads, tails, heads, tails.

The coin has ended up on tails 6 times out of 10 – you could conclude that “since a completely fair coin should end up on heads exactly 5 times, and tails exactly 5 times, when thrown 10 times, this coin must be biased.” But of course, that would be stupid. That’s not how randomness works. Even a fair coin can end up tails 6 times out of 10.

However, if we keep throwing it, and it keeps landing on tails more than heads, then we might be onto something. In other words, how strongly we believe the coin is biased depends on two pieces of information:

- How often does it land on tails? (In this case, 40 percent.)
- How many times have we thrown it? (In this case, 10 times.)

Tossing a coin is mathematically speaking called a *Bernoulli* process,
i.e. an event that has some probability \(p\) of being 1, and otherwise is 0.
In the case of a fair coin,
we know that the probability \(p = 0.5\).

However, in this case we *don’t yet know anything about \(p\)*!
As far as we know,
\(p\) could be anything from 0 (coin always lands tails) to 1 (coin always lands heads.)

We could imagine that
the probability \(p\) *itself* is a random variable following a *Beta* distribution.
What’s so nice about that is that every time we toss our coin,
we can easily plug in the result to update our belief about \(p\)
^{1}^{1} Technically, we call the Beta distribution a *conjugate prior*
distribution to the Bernoulli distribution, because when computing the
posterior distribution of the parameter \(p\), the resulting expression
simplifies to the Beta distribution again, but with different parameters..

This gets a bit abstract, so let’s work through a concrete example.

The Beta distribution has two parameters: \(\alpha\) and \(\beta\). When we start out knowing nothing about \(p\), we set \(\alpha = 1, \beta = 1\). In other words, we say that \(p\) is drawn from \(\mathrm{Beta}(α=1, β=1)\).

Then we bring out our powerful computer and ask for 1000 guesses at what \(p\) could be.

As you can see, the reasonable guesses for the probability \(p\) are all over the place. Any guess is just as valid as another.

But then something cool happens.
When we toss the coin the first time, we will get an *observation*.
We can use this observation to update our belief about the probability \(p\).

This is the algorithm: our current understanding of \(p\) is encoded in the parameters \(\alpha, \beta\). If we flip the coin and it lands heads, then we add one to \(\alpha\). Othewrise, if the coin lands tails, we add one to \(\beta\) instead.

So let’s say we toss the coin and get heads. Our updated belief about \(p\) is then that it is distributed according to \(\mathrm{Beta}(\alpha = 2, \beta=1)\).

What does this mean for \(p\)?

Aha!
Most guesses are still valid,
but the low end of the diagram is looking somewhat empty.
^{2}^{2} This makes sense
if you think about what that end of the diagram means:
if \(p=0\), then the coin will almost surely land on tails.
Since the only outcome we’ve seen so far is heads,
it’s unlikely the coin is strongly biased towards tails.

We can continue tossing the coin, and update the our belief about \(p\) after every toss. (You probably see a convenient interpretation for the two parameters \(\alpha, \beta\) now – \(\alpha - 1\) represents the number of heads, and \(\beta - 1\) the number of tails.) Given the 10 tosses we have seen so far, we ask the computer for a guess of what \(p\) might be:

Given the sequence of tosses we saw and a Beta prior, we can say with relative confidence that \(p\) is probably between 0.1 and 0.8. That does not help us decide if the coin is biased, because you know what’s between 0.1 and 0.8? A perfectly fair coin.

To make a more definitive ruling, we would have to flip the coin more times. Let’s flip the coin another 490 times, and update our belief about \(p\) each time.

When we’re done, we’ve flipped the coin 500 times, 268 of which landed on heads.

Generating values for \(p\) with the computer reveals that it’s highly likely \(p\) is between 0.45 and 0.6.

But so is a fair coin.

If you want really hard to prove that the coin is unfair,
you could toss the coin another 500 times.
But to spare you the trouble,
I’ll reveal that all you’d do
is more strongly prove that the coin is fair,
because I used a fair coin to get these numbers.
^{3}^{3} If you truly, truly want to prove the coin is unfair,
despite it being fair,
what you have to do is reset the counters.
Set \(\alpha\) and \(\beta\) back to 1.
Start the experiment over with a lower number tosses, like 100.
You want enough tosses to sound convincing,
but few enough to get a streak by accident.
Then if you fail to prove the coin is unfair,
try again.
Reset the counters.
Flip 100 times again.
This is how you cheat at science:
repeat the experiment a good number of times
and publish only the one time that by accident you got the result you wanted.

# Gamma Distribution

If x ~ Exponential(λ) where λ ~ Gamma(α, β), then p’ is Gamma(α + 1, β + x)

If x ~ Poisson(λ) where λ ~ Gamma(α, β), then p’ is Gamma(α + x, β + 1)

# Pareto Distribution

Say you’re walking down to the bus stop every day, and the bus comes often enough that you can just go down to the stop without looking at the timetable.

Then your waiting times are uniformly distributed: you’ll wait either 0 minutes (if you happen to arrive at the same time as the bus) or \(k\) minutes, where \(k\) is how often the bus arrives (corresponding to you just missing a bus and having to wait for the next one), or any number of minutes in between.

What can we say about how often buses come based on how long you’ve had to wait?

A convenient prior for \(k\) (time between buses) is the *Pareto* distribution.
The Pareto distribution is parametrised by two numbers: \(m\) and \(a\).
When we don’t know anything, we set both of these to 1,
meaning we think buses come relatively often,
but they might as well come only very rarely.

With a Pareto prior, things can change very drastically with each observation.

The formula to update \(m\) and \(a\) is as such: record how many minutes you have to wait for the bus, and call this number \(x\).

Then if \(x\) is larger than \(m\), use \(x\) as the new \(m\). If \(x\) is less than \(m\), keep the old \(m\).

Whichever is the case above, always add one to \(a\).

In other words, \(m\) represents the longest you’ve ever had to wait, and \(a\) represents the number of observations you have made.

Let’s say on the first day of the experiment you wait for 9 minutes.

I told you it would change quickly.

Given the wait time of 9 minutes,
the updated prior states that
the bus *cannot* arrive more frequently than every 9 minutes,
because if it did,
you wouldn’t have waited for 9 minutes
^{4}^{4} Obviously, this model disregards things such as delays
and random variability in the arrival time of buses.
This model assumes buses arrive exactly when they are supposed to..

The following days, you have to wait 6, 9, 2, 7, 8, 12, 1, 3 and 4 minutes.

We set \(m\) to be the greatest between the previous \(m\) (which was 9) and the numbers you recorded, i.e. 12. We set \(a\) to be one more than the total number of recordings.

Already, we can see that it is likely the bus comes at least every 18 minutes, but probably more often.

We record 10 more waiting times, which are 8, 3, 5, 0, 12, 10, 10, 7, 4.

Note that when updating our belief, any number lower than \(m\) does not change \(m\). It might feel like we’re throwing away data by doing that, but that’s not quite the case, and you might see why in the updated diagram.

Since no number so far has been greater than 12,
the probable times between buses hasn’t moved in the diagram,
but it has *narrowed*.

Intuitively,
the fact that we see additional numbers below \(m\)
is weak evidence in *support* of \(m\) being the true maximum wait time.
This support is carried through the \(a\) parameter,
which narrows the distribution.

Just before ending the experiment, you record five more wait times: 7, 13, 2, 14, 12, and update the prior.

The mean value of these samples are 15 minutes.
^{5}^{5} There is a neat intuitive view
for why the mean value is slightly greater
than the greatest observed value
– if the observations are truly uniformly distributed,
they are expected to split the range from 0 to *k* evenly,
which means the greatest value *m* will on average be
*k* / *m* less than the true *k*.

There is an actual, real-life ferry that I sometimes ride. The numbers in this article are actual waiting times I measured. Do you want to know how often the ferry arrives? According to the time table, every 15 minutes.