Bootstrapping Kaplan–Meier Survival Curves

kqr

, published 2022-07-28

Tags:

The Bootstrap
- Step 1: Replicate
- Step 2: Repeat
Graphic Survival Curves
Hazard Rate Estimation
Referencing This Article

In the previous article on survival analysis for customer retention, we learned how to compute a Kaplan–Meier estimation that tells us how long customers stay with us.

We started with data from our nine customers, past and current. They had stayed with us for this many months:

8	12	12	13	17	29	40	59	70
—	—	—	X	—	X	—	X	X

Where X indicates that the customer in question cancelled after that many months, and a dash indicates that the customer is still with us, and has been for that many months so far.

We ended up with the following estimation of survival probabilities:

Data set	Minimum	25 %	50 %	75 %	Maximum
Original data	12	21	47	61	70

This says that based on historical experience, customers stay with us at least 12 months and at most 70 months. Half of them cancel after 47 months. Now we want to know how confident we can be in these results.

Just by intuition alone, we can probably say “not very confident” because we have only 9 data points, of which the event of interest has only occurred in 4.

However, to make a good decision we probably want to know in numbers what “not very confident” means. There are some theoretical ways to estimate confidence, but in general, if I have access to a computer, I find the bootstrap to be much easier.

The Bootstrap

There’s a beautiful idea behind the bootstrap, that’s unfortunately a bit complicated to explain without more statistical background.1¹ It relates to how the observations we have are the best information we have about the underlying truth, but it’s a bit more complicated than it sounds. So we’ll jump straight into practise.

In more practical terms, there are two steps to using the bootstrap:

Resample (with replacement) from the data collected. Perform whatever analysis we want on this data. The outcome of this is a bootstrap replication of the analysis.
Repeat step one many times to create many bootstrap replications. Treat the bootstrap replications as the distribution of sampling error in our analysis.

Step 1: Replicate

The data we have collected is, to reiterate,

8	12	12	13	17	29	40	59	70
—	—	—	X	—	X	—	X	X

To create a bootstrap replication of the Kaplan–Meier survival curve, we draw a new sample from the original data with replacement. In our case, this means we generate a random number between 1 and 9, nine times. That number indicates which data points we choose from our original data.

The result of this might be that we take data points 1, 1, 2, 2, 4, 5, 5, 8, 9. We compute the Kaplan–Meier estimation of those data:

Index	Time	Status	Cum. survival
1	8	—	100 %
1	8	—	100 %
2	12	—	100 %
2	12	—	100 %
4	13	X	80 %
5	17	—	80 %
5	17	—	80 %
8	59	X	40 %
9	70	X	0 %

And then the summary statistics:

Data set	Minimum	25 %	50 %	75 %	Maximum
Original data	12	21	47	61	70
Bootstrap replication 1	12	22	49	63	70

In this case, the replication ended up being fairly close to our original data. It will often be, but not always.

Step 2: Repeat

We create a new replication. This time our random number generator gave us 4, 5, 6, 6, 7, 8, 9, 9, 9.

Index	Time	Status	Cum. survival
4	13	X	89 %
5	17	—	89 %
6	29	X	63 %
6	29	X	63 %
7	40	—	63 %
8	59	X	48 %
9	70	X	0 %
9	70	X	0 %
9	70	X	0 %

In this case, our random number generator selected almost only customers that canceled! This gets us a new replication of the summary statistics:

Data set	Minimum	25 %	50 %	75 %	Maximum
Original data	12	21	47	61	70
Bootstrap replication 1	12	22	49	63	70
Bootstrap replication 2	0	23	56	64	70

We do it again:

Data set	Minimum	25 %	50 %	75 %	Maximum
Original data	12	21	47	61	70
Bootstrap replication 1	12	22	49	63	70
Bootstrap replication 2	0	23	56	64	70
Bootstrap replication 3	12	13	49	60	70

And again:

Data set	Minimum	25 %	50 %	75 %	Maximum
Original data	12	21	47	61	70
Bootstrap replication 1	12	22	49	63	70
Bootstrap replication 2	0	23	56	64	70
Bootstrap replication 3	12	13	49	60	70
Bootstrap replication 4	12	12	13	13	29

If we keep doing it very many times, we will end up with a large number of potential values for our percentile estimations.

Let’s focus on the 50 % column: by what time will half our customers have canceled? If we take the 50 % column after many such bootstrap replications, it might look like

47	49	56	49	13	42	59	48	24
21	48	49	66	62	29	46	64	29
49	59	∞	64	29	47	44	47	64
40	51	47	59	66	59	50	51	29
70	50	29	62	23	49	28	40	53

We see that the first five numbers are the ones we’ve seen before: the very first one is what we got with our original data, and the next four are the first four bootstrap replications.

The infinity symbol might surprise you: this was from a replication where almost no customers cancelled, so in that replication, we estimate that half of our customers will practically never cancel. We represent “practically never” as infnity in our calculations to make things easier.

We can create a stem-and-leaf plot2² A stem-and-leaf plot is basically a lightweight text-based histogram, popularised by John Tukey. For example, the last measurement we had, 53, becomes a 3 in the row marked 5. Similarly, 28 becomes an 8 in the row marked 2, and so on. of these numbers to get a better sense of how they are distributed:

0
1	3
2	1	3	4	8	9	9	9	9	9
3
4	0	0	2	4	6	7	7	7	7	8	8	9	9	9	9	9
5	0	0	1	1	3	6	9	9	9	9
6	2	2	4	4	4	6	6
7	0
8
9
…	∞

This distribution gives us an indication of the sampling error of the Kaplan–Meier estimator at the half point. Simple arithmetic tells us that the mass of the data (the central 90 % of it) lies between 23 and 66. Another way to put it is that we estimate that half of our customers cancel after 47 months, with 23–66 months being our 90 % confidence interval.

Graphic Survival Curves

It was easy to derive this confidence interval with simulation, but for additional intuition, we can look at what happened graphically.

We can plot our original Kaplan–Meier estimation the way it’s customary to: as a descending stepped line, with little cross-marks for censored observations.

We can then draw in the first bootstrap replication, too (as a lighter gray line):

Then we add one more:

And we keep going. In this case I’ve changed two things to make the result more clear: I’ve added some time jitter to each replication to more clearly distinguish the curves, and I’ve chosen to linearly interpolate between events. The latter is a big no-no, but it helps make the mass of bootstrap replications more visually clear.

This is not as easy to read3³ and the actual result depends a bit on how the randomisation works out each time I re-generate this article from its source code but we can still sort of see where the bulk of the data builds up: the 0.5 horizontal (which indicates when half of our customers have cancelled) sees survival curves roughly in the 20–60 month range.

With more data, it would look less stepped and nicer.

Hazard Rate Estimation

We have looked at how long it takes half of our customers to cancel, but we can do something else, too! We go back to the plot of the original data.

If we try to fit a line through this, we get something like

The slope of this line is -0.012, which means that each month, we lose on average 1.2 % of our customers. Phrased differently, the every month, there’s a 1.2 % risk that a customer cancels. This is known as the hazard rate in the field.4⁴ You’ll find, if you fit a parametric exponential accelerated failure time model to this data, that it estimates a hazard rate of 1.5 %. This may well be down to the choice of model, and the Kaplan–Meier estimator is free of model assumptions.

As before, we can draw new bootstrap replications and compute these lines for each of them:

The central 90 % of these slopes lie between 0.88 % and 1.76 %, so we can say that we estimate the hazard rate to be 1.12 %, with a 90 % confidence interval of 0.88 %–1.76 %.

This is a very useful metric to know for your business.

Referencing This Article

Computing Poker Win Rate From Sessions

Two Wrongs