# Bootstrapping Kaplan–Meier Survival Curves

# Table of Contents

In the previous article on survival analysis for customer retention, we learned how to compute a Kaplan–Meier estimation that tells us how long customers stay with us.

We started with data from our nine customers, past and current. They had stayed with us for this many months:

8 | 12 | 12 | 13 | 17 | 29 | 40 | 59 | 70 |

— | — | — | X | — | X | — | X | X |

Where X indicates that the customer in question cancelled after that many
months, and a dash indicates that the customer is still with us, and has been
for that many months *so far*.

We ended up with the following estimation of survival probabilities:

Data set | Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|---|

Original data | 12 | 21 | 47 | 61 | 70 |

This says that based on historical experience, customers stay with us at least 12 months and at most 70 months. Half of them cancel after 47 months. Now we want to know how confident we can be in these results.

Just by intuition alone, we can probably say “not very confident” because we have only 9 data points, of which the event of interest has only occurred in 4.

However, to make a good decision we probably want to know in numbers what “not very confident” means. There are some theoretical ways to estimate confidence, but in general, if I have access to a computer, I find the bootstrap to be much easier.

# The Bootstrap

There’s a beautiful idea behind the bootstrap, that’s unfortunately a bit
complicated to explain without more statistical background.^{1}^{1} It relates to
how the observations we have are the best information we have about the
underlying truth, but it’s a bit more complicated than it sounds. So we’ll jump
straight into practise.

In more practical terms, there are two steps to using the bootstrap:

- Resample (with replacement) from the data collected. Perform whatever analysis we want on this data. The outcome of this is a bootstrap replication of the analysis.
- Repeat step one many times to create many bootstrap replications. Treat the bootstrap replications as the distribution of sampling error in our analysis.

## Step 1: Replicate

The data we have collected is, to reiterate,

8 | 12 | 12 | 13 | 17 | 29 | 40 | 59 | 70 |

— | — | — | X | — | X | — | X | X |

To create a bootstrap replication of the Kaplan–Meier survival curve, we draw a new sample from the original data with replacement. In our case, this means we generate a random number between 1 and 9, nine times. That number indicates which data points we choose from our original data.

The result of this might be that we take data points 1, 1, 2, 2, 4, 5, 5, 8, 9. We compute the Kaplan–Meier estimation of those data:

Index | Time | Status | Cum. survival |
---|---|---|---|

1 | 8 | — | 100 % |

1 | 8 | — | 100 % |

2 | 12 | — | 100 % |

2 | 12 | — | 100 % |

4 | 13 | X | 80 % |

5 | 17 | — | 80 % |

5 | 17 | — | 80 % |

8 | 59 | X | 40 % |

9 | 70 | X | 0 % |

And then the summary statistics:

Data set | Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|---|

Original data | 12 | 21 | 47 | 61 | 70 |

Bootstrap replication 1 | 12 | 22 | 49 | 63 | 70 |

In this case, the replication ended up being fairly close to our original data. It will often be, but not always.

## Step 2: Repeat

We create a new replication. This time our random number generator gave us 4, 5, 6, 6, 7, 8, 9, 9, 9.

Index | Time | Status | Cum. survival |
---|---|---|---|

4 | 13 | X | 89 % |

5 | 17 | — | 89 % |

6 | 29 | X | 63 % |

6 | 29 | X | 63 % |

7 | 40 | — | 63 % |

8 | 59 | X | 48 % |

9 | 70 | X | 0 % |

9 | 70 | X | 0 % |

9 | 70 | X | 0 % |

In this case, our random number generator selected almost only customers that canceled! This gets us a new replication of the summary statistics:

Data set | Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|---|

Original data | 12 | 21 | 47 | 61 | 70 |

Bootstrap replication 1 | 12 | 22 | 49 | 63 | 70 |

Bootstrap replication 2 | 0 | 23 | 56 | 64 | 70 |

We do it again:

Data set | Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|---|

Original data | 12 | 21 | 47 | 61 | 70 |

Bootstrap replication 1 | 12 | 22 | 49 | 63 | 70 |

Bootstrap replication 2 | 0 | 23 | 56 | 64 | 70 |

Bootstrap replication 3 | 12 | 13 | 49 | 60 | 70 |

And again:

Data set | Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|---|

Original data | 12 | 21 | 47 | 61 | 70 |

Bootstrap replication 1 | 12 | 22 | 49 | 63 | 70 |

Bootstrap replication 2 | 0 | 23 | 56 | 64 | 70 |

Bootstrap replication 3 | 12 | 13 | 49 | 60 | 70 |

Bootstrap replication 4 | 12 | 12 | 13 | 13 | 29 |

If we keep doing it very many times, we will end up with a large number of potential values for our percentile estimations.

Let’s focus on the 50 % column: by what time will half our customers have canceled? If we take the 50 % column after many such bootstrap replications, it might look like

47 | 49 | 56 | 49 | 13 | 42 | 59 | 48 | 24 |

21 | 48 | 49 | 66 | 62 | 29 | 46 | 64 | 29 |

49 | 59 | ∞ | 64 | 29 | 47 | 44 | 47 | 64 |

40 | 51 | 47 | 59 | 66 | 59 | 50 | 51 | 29 |

70 | 50 | 29 | 62 | 23 | 49 | 28 | 40 | 53 |

We see that the first five numbers are the ones we’ve seen before: the very first one is what we got with our original data, and the next four are the first four bootstrap replications.

The infinity symbol might surprise you: this was from a replication where almost no customers cancelled, so in that replication, we estimate that half of our customers will practically never cancel. We represent “practically never” as infnity in our calculations to make things easier.

We can create a stem-and-leaf plot^{2}^{2} A stem-and-leaf plot is basically a
lightweight text-based histogram, popularised by John Tukey. For example, the
last measurement we had, 53, becomes a 3 in the row marked 5. Similarly, 28
becomes an 8 in the row marked 2, and so on. of these numbers to get a better
sense of how they are distributed:

0 |
||||||||||||||||

1 |
3 | |||||||||||||||

2 |
1 | 3 | 4 | 8 | 9 | 9 | 9 | 9 | 9 | |||||||

3 |
||||||||||||||||

4 |
0 | 0 | 2 | 4 | 6 | 7 | 7 | 7 | 7 | 8 | 8 | 9 | 9 | 9 | 9 | 9 |

5 |
0 | 0 | 1 | 1 | 3 | 6 | 9 | 9 | 9 | 9 | ||||||

6 |
2 | 2 | 4 | 4 | 4 | 6 | 6 | |||||||||

7 |
0 | |||||||||||||||

8 |
||||||||||||||||

9 |
||||||||||||||||

… |
∞ |

This distribution gives us an indication of the sampling error of the Kaplan–Meier estimator at the half point. Simple arithmetic tells us that the mass of the data (the central 90 % of it) lies between 23 and 66. Another way to put it is that we estimate that half of our customers cancel after 47 months, with 23–66 months being our 90 % confidence interval.

# Graphic Survival Curves

It was easy to derive this confidence interval with simulation, but for additional intuition, we can look at what happened graphically.

We can plot our original Kaplan–Meier estimation the way it’s customary to: as a descending stepped line, with little cross-marks for censored observations.

We can then draw in the first bootstrap replication, too (as a lighter gray line):

Then we add one more:

And we keep going. In this case I’ve changed two things to make the result more clear: I’ve added some time jitter to each replication to more clearly distinguish the curves, and I’ve chosen to linearly interpolate between events. The latter is a big no-no, but it helps make the mass of bootstrap replications more visually clear.

This is not as easy to read^{3}^{3} and the actual result depends a bit on how the
randomisation works out each time I re-generate this article from its source
code but we can still sort of see where the bulk of the data builds up: the 0.5
horizontal (which indicates when half of our customers have cancelled) sees
survival curves roughly in the 20–60 month range.

With more data, it would look less stepped and nicer.

# Hazard Rate Estimation

We have looked at how long it takes half of our customers to cancel, but we can do something else, too! We go back to the plot of the original data.

If we try to fit a line through this, we get something like

The slope of this line is -0.012, which means that each month, we lose on
average 1.2 % of our customers. Phrased differently, the every month, there’s a
1.2 % risk that a customer cancels. This is known as the *hazard rate* in the
field.^{4}^{4} You’ll find, if you fit a parametric exponential accelerated
failure time model to this data, that it estimates a hazard rate of 1.5 %. This
may well be down to the choice of model, and the Kaplan–Meier estimator is free
of model assumptions.

As before, we can draw new bootstrap replications and compute these lines for each of them:

The central 90 % of these slopes lie between 0.88 % and 1.76 %, so we can say that we estimate the hazard rate to be 1.12 %, with a 90 % confidence interval of 0.88 %–1.76 %.

This is a *very* useful metric to know for your business.