Two Wrongs

Intuition and Spaced Repetition

Intuition and Spaced Repetition

Can you force intuition for difficult concepts by beating an idea into your head repeatedly? I’m not sure, but sometimes it feels that way.

I’ve written previously about how my reading patterns have changed since I started with spaced repetition. Since then, I’ve experimented with a tweak to the approach: instead of spending longer time with something to understand it firmly, I make flashcards, move on fairly quickly, and then go back and re-read parts of the material that seems important but that I still don’t get.1 Yes, this is how people have recommended reading technical texts since the beginning of technical writing, but I’ve been slow to pick it up. After some point, it’s like a light switch goes on.

Example with a property of subexponential distributions

Here’s a property of subexponential distributions given in a book I’m currently reading2 Modelling Extremal Events; Embrechts et al.; Springer; 1997:

\[\overline{F^{n\ast}}(y) \sim n \overline{F}(y)\;\; \mathrm{as}\;\; y \rightarrow \infty\]

This seemed important to me, so I bookmarked the page it was on, even though it made no sense the first time I read it. I kept reading and at one point I thought I understood it, but I absolutely did not. Around the fifth time I came back to it and thought about it, suddenly it was clear!

In this post, I will explain how I came to that intuition, and the point is that every single fact I list is one that I have read somewhere and made a flashcard from. The grand total is an intuitive sense for something theoretically dense.

Seeing self-convolutions graphically

To start somewhere, let’s imagine a random walk generated from coin flips. Here’s a realisation of that. At every point in time, we go forward one, and then if a coin flip lands heads, we also go up one, otherwise we go down one.

int-spa-rep-01.svg

As we go through examples of random walks, I want you to keep one question in your head: where does this random walk end up? What are the coordinates of the final point?

The first coordinate (x value) we know: it’s the number of steps – in this case 40. The second coordinate (y value) is a little more complicated, because it’s a sum of (in this case) 40 random values. In other words, the second coordinate is going to itself be a random value. Keep that in mind!

We are not limited to walks based on coin flips. We can draw step sizes from any other distribution, like the normal distribution.3 In fact, this is the limit of the coin flip walk: if we keep adding steps and zooming out, it will be harder and harder to distinguish the coin flip walk from one with normal step sizes.

int-spa-rep-03.svg

What are the final coordinates of this random walk? The general idea is the still same: the first coordinate is 40, the second is the sum of 40 random values, though this time they are drawn from the normal distribution.

To get a better understanding of where this random walk can end up, let’s generate multiple random walks of the same type.

int-spa-rep-05.svg

Perhaps unsurprisingly, it is more common for the random walk to end up on central points, and then it thins out toward the extreme values.

Each y coordinate is associated with a probability that the random walk ends up on that coordinate, and this function is itself a probability distribution. For a random walk with \(n\) steps, each drawn from the distribution \(F\), we notate the distribution of the final coordinate4 And in technical terminology, we call it the n-fold convolution of F with itself. as

\[F^{n\ast}(y).\]

In other words, when we say \(F^{n\ast}\) we mean, intuitively speaking, the height of the final point of a random walk that has taken \(n\) steps, each step drawn from the distribution \(F\).

For some step sizes (e.g. normal) we can compute an exact distribution for the final point of the random walk, but the distribution exists even if we cannot compute it exactly.

Seeing a random walk survive a barrier

Let’s assume there’s something bad that happens below \(y=10\). Maybe there will be a great flood, and all random walks that stop below that level will die. In this plot, the random walks that survive the flood are drawn in a bold black.5 Incidentally, this is a great illustration of Darwinian evolution. When we look at the world around us, we see only the end product of natural selection – we see only the bold black paths. Imagine you couldn’t see the translucent grey ones. If you told me “these random walks have a positive bias” I would have believed you. But they don’t! It’s all selection bias. People before Darwin thought evolution had a bias toward the more complex. It does not; it’s all selection bias, Darwin said.

int-spa-rep-06.svg

In this instance, around 8 of the 100 random walks survived the barrier at \(y=10\). The proportion survived is approximately 8 %. This proportion depends on the height of the barrier, and it’s called the survival function of a distribution. When the distribution is the final point of a random walk, the survival function is notated mathematically as

\[\overline{F^{n\ast}}(y).\]

For any barrier at \(y\), the probability that a random walk of \(n\) steps survives (i.e. ends up above it) is given by \(\overline{F^{n\ast}}(y)\).

Look where we are now. The relationship we are trying to understand is, repeated for convenience,

\[\overline{F^{n\ast}}(y) \sim n \overline{F}(y)\;\; \mathrm{as}\;\; y \rightarrow \infty\]

Now we know what the first part of that means! It’s talking about the distribution of random walks of n steps that stop higher than some \(y\).

Single step survival is low

Now imagine a random walk of just a single step. What do you think is the probability that it survives the barrier at \(y=10\)?

int-spa-rep-07.svg

Haha, nope! The steps is random, but it is practically guaranteed to be less than, say, 3. The probability that we survive the barrier on \(y=10\) given just a single step is virtually 0 %. This probability is also a survival function, but of a single step size, rather than a full walk. You may be able to guess how it’s written, mathematically:

\[\overline{F}(y).\]

But what if we had multiple shots at it? What if we could launch 40 random walks – err, ranom steps. What is the probability that at least one of them survives? Under relevant assumptions – i.e. that \(y\) is chosen so that \(\overline{F}(y)\) is small, \(n\) is kept reasonably low, and the steps are independent, we can approximate the probability that at least one of \(n\) random steps survive the barrier with

\[n \overline{F}(y)\]

For the barrier at \(y=10\), the result is disappointing:

int-spa-rep-08.svg

It doesn’t matter how many chances we are given – the step is always so small that practically none of the steps will exceed 10.

So now we know what the second part of the relationship means too!

\[n \overline{F}(y)\]

is an approximation for the probability that at least one out of \(n\) unrelated random steps exceeds \(y\).

So the relationship,

\[\overline{F^{n\ast}}(y) \sim n \overline{F}(y)\;\; \mathrm{as}\;\; y \rightarrow \infty\]

can be read as the probability that \(y\) is exceeded for a random walk of length n is distributed the same as the probability that at least one out of \(n\) independent random steps exceeds \(y\), when we select an arbitrarily high barrier \(y\).

Reality check for normally distributed steps

Does that make sense, though? We saw before that a random walk of 40 steps easily exceeded the barrier at \(y=10\), but a random step has no chance of breaking that barrier – not even many of them.

Now imagine we go from \(n=40\) to \(n=400\) – does that make the relationship more true? No. The 400 single steps still have no chance of surviving the \(y=10\) barrier, but the random walk will be increasingly able to. It doesn’t help if we pick a higher barrier (that \(y \rightarrow \infty\) condition) – the general idea is still the same: a step is too small, too localised, to ever have a chance of surviving a barrier the way a random walk does.

We are missing one thing: the relationship isn’t supposed to hold for the normal distribution. The relationship holds specifically for subexponential distributions.

Random walks with subexponential steps

To start off our journey with subexponential distributions, let’s look at a random walk again, this time with Cauchy distributed steps – the Cauchy distribution belongs to the subexponential class.

int-spa-rep-09.svg

It’s a little more wiggly than the random walk with normally distributed steps. Let’s zoom out a little.

int-spa-rep-10.svg

To be clear, there’s nothing fundamentally different about this random walk. It works according to the exact same principle as the one with coin tosses we saw first. The only difference is that when a step is about to happen, the step size is drawn from a Cauchy distribution instead. The result of this is that most of the steps it takes are very small, but sometimes it makes huge leaps as well.6 What it doesn’t do very often is take moderately sized steps. The Cauchy distribution results in more small steps and big steps, and fewer medium-size steps.

Now, let’s get a sense of how the distribution

\[F^{n\ast}\]

differs between the cases where \(F\) is the normal distribution and \(F\) is the Cauchy distribution.

int-spa-rep-11.svg

The normally distributed steps give a random walk that looks like we would expect.

int-spa-rep-12.svg

The Cauchy distributed steps give a significantly more … interesting random walk. A lot of the time it doesn’t really go anywhere, but every once in a while it makes a huge leap.

At this point, you may suspect where this is going. Here are 20 random walks against a barrier at \(y=150\).

int-spa-rep-13.svg

Look at which of those random walks end up over the barrier. Notice something about them? It usually involves a big leap. Now look at this:

\[\overline{F^{n\ast}}(y) \sim n \overline{F}(y)\;\; \mathrm{as}\;\; y \rightarrow \infty\]

Or, in English, the probability that \(y\) is exceeded for a random walk of length n is distributed the same as the probability that at least one out of \(n\) independent random steps exceeds \(y\), when we select an arbitrarily high barrier \(y\).

It makes sense! The ability of a short random walk with subexponential steps to clear a very high barrier rests almost entirely on its ability to generate a step roughly of barrier-clearing size at some point during that walk. This does not happen for random walks with exponential and superexponential step sizes, because they need to rely on the cumulative effects of random chance and small steps to survive barriers.

Thus, it’s a property of subexponential distributions.

Intuition and Spaced Repetition

Looking back at what was required to understand the relationship, it dawns on me that spaced repetition does not help us intuit things by hammering in that one concept until it makes sense. Rather, it helps us intuit things by hammring in everything around it. All of the little prerequisites that one needs to know by heart.

It’s a little like building with lego bricks or something – spaced repetition helps ensure all the tiny pieces are in the right place, so that the big castle can happen without structural integrity issues.