Two Wrongs

Bootstrapping Kaplan–Meier Survival Curves

Bootstrapping Kaplan–Meier Survival Curves

In the previous article on survival analysis for customer retention, we learned how to compute a Kaplan–Meier estimation that tells us how long customers stay with us.

We started with data from our nine customers, past and current. They had stayed with us for this many months:

8 12 12 13 17 29 40 59 70
X X X X

Where X indicates that the customer in question cancelled after that many months, and a dash indicates that the customer is still with us, and has been for that many months so far.

We ended up with the following estimation of survival probabilities:

Data set Minimum 25 % 50 % 75 % Maximum
Original data 12 21 47 61 70

This says that based on historical experience, customers stay with us at least 12 months and at most 70 months. Half of them cancel after 47 months. Now we want to know how confident we can be in these results.

Just by intuition alone, we can probably say “not very confident” because we have only 9 data points, of which the event of interest has only occurred in 4.

However, to make a good decision we probably want to know in numbers what “not very confident” means. There are some theoretical ways to estimate confidence, but in general, if I have access to a computer, I find the bootstrap to be much easier.

The Bootstrap

There’s a beautiful idea behind the bootstrap, that’s unfortunately a bit complicated to explain without more statistical background.11 It relates to how the observations we have are the best information we have about the underlying truth, but it’s a bit more complicated than it sounds. So we’ll jump straight into practise.

In more practical terms, there are two steps to using the bootstrap:

  1. Resample (with replacement) from the data collected. Perform whatever analysis we want on this data. The outcome of this is a bootstrap replication of the analysis.
  2. Repeat step one many times to create many bootstrap replications. Treat the bootstrap replications as the distribution of sampling error in our analysis.

Step 1: Replicate

The data we have collected is, to reiterate,

8 12 12 13 17 29 40 59 70
X X X X

To create a bootstrap replication of the Kaplan–Meier survival curve, we draw a new sample from the original data with replacement. In our case, this means we generate a random number between 1 and 9, nine times. That number indicates which data points we choose from our original data.

The result of this might be that we take data points 1, 1, 2, 2, 4, 5, 5, 8, 9. We compute the Kaplan–Meier estimation of those data:

Index Time Status Cum. survival
1 8 100 %
1 8 100 %
2 12 100 %
2 12 100 %
4 13 X 80 %
5 17 80 %
5 17 80 %
8 59 X 40 %
9 70 X 0 %

And then the summary statistics:

Data set Minimum 25 % 50 % 75 % Maximum
Original data 12 21 47 61 70
Bootstrap replication 1 12 22 49 63 70

In this case, the replication ended up being fairly close to our original data. It will often be, but not always.

Step 2: Repeat

We create a new replication. This time our random number generator gave us 4, 5, 6, 6, 7, 8, 9, 9, 9.

Index Time Status Cum. survival
4 13 X 89 %
5 17 89 %
6 29 X 63 %
6 29 X 63 %
7 40 63 %
8 59 X 48 %
9 70 X 0 %
9 70 X 0 %
9 70 X 0 %

In this case, our random number generator selected almost only customers that canceled! This gets us a new replication of the summary statistics:

Data set Minimum 25 % 50 % 75 % Maximum
Original data 12 21 47 61 70
Bootstrap replication 1 12 22 49 63 70
Bootstrap replication 2 0 23 56 64 70

We do it again:

Data set Minimum 25 % 50 % 75 % Maximum
Original data 12 21 47 61 70
Bootstrap replication 1 12 22 49 63 70
Bootstrap replication 2 0 23 56 64 70
Bootstrap replication 3 12 13 49 60 70

And again:

Data set Minimum 25 % 50 % 75 % Maximum
Original data 12 21 47 61 70
Bootstrap replication 1 12 22 49 63 70
Bootstrap replication 2 0 23 56 64 70
Bootstrap replication 3 12 13 49 60 70
Bootstrap replication 4 12 12 13 13 29

If we keep doing it very many times, we will end up with a large number of potential values for our percentile estimations.

Let’s focus on the 50 % column: by what time will half our customers have canceled? If we take the 50 % column after many such bootstrap replications, it might look like

47 49 56 49 13 42 59 48 24
21 48 49 66 62 29 46 64 29
49 59 64 29 47 44 47 64
40 51 47 59 66 59 50 51 29
70 50 29 62 23 49 28 40 53

We see that the first five numbers are the ones we’ve seen before: the very first one is what we got with our original data, and the next four are the first four bootstrap replications.

The infinity symbol might surprise you: this was from a replication where almost no customers cancelled, so in that replication, we estimate that half of our customers will practically never cancel. We represent “practically never” as infnity in our calculations to make things easier.

We can create a stem-and-leaf plot22 A stem-and-leaf plot is basically a lightweight text-based histogram, popularised by John Tukey. For example, the last measurement we had, 53, becomes a 3 in the row marked 5. Similarly, 28 becomes an 8 in the row marked 2, and so on. of these numbers to get a better sense of how they are distributed:

0                                
1 3                              
2 1 3 4 8 9 9 9 9 9              
3                                
4 0 0 2 4 6 7 7 7 7 8 8 9 9 9 9 9
5 0 0 1 1 3 6 9 9 9 9            
6 2 2 4 4 4 6 6                  
7 0                              
8                                
9                                
                             

This distribution gives us an indication of the sampling error of the Kaplan–Meier estimator at the half point. Simple arithmetic tells us that the mass of the data (the central 90 % of it) lies between 23 and 66. Another way to put it is that we estimate that half of our customers cancel after 47 months, with 23–66 months being our 90 % confidence interval.

Graphic Survival Curves

It was easy to derive this confidence interval with simulation, but for additional intuition, we can look at what happened graphically.

We can plot our original Kaplan–Meier estimation the way it’s customary to: as a descending stepped line, with little cross-marks for censored observations.

Sorry, your browser does not support SVG.

We can then draw in the first bootstrap replication, too (as a lighter gray line):

Sorry, your browser does not support SVG.

Then we add one more:

Sorry, your browser does not support SVG.

And we keep going. In this case I’ve changed two things to make the result more clear: I’ve added some time jitter to each replication to more clearly distinguish the curves, and I’ve chosen to linearly interpolate between events. The latter is a big no-no, but it helps make the mass of bootstrap replications more visually clear.

Sorry, your browser does not support SVG.

This is not as easy to read33 and the actual result depends a bit on how the randomisation works out each time I re-generate this article from its source code but we can still sort of see where the bulk of the data builds up: the 0.5 horizontal (which indicates when half of our customers have cancelled) sees survival curves roughly in the 20–60 month range.

With more data, it would look less stepped and nicer.

Hazard Rate Estimation

We have looked at how long it takes half of our customers to cancel, but we can do something else, too! We go back to the plot of the original data.

Sorry, your browser does not support SVG.

If we try to fit a line through this, we get something like

Sorry, your browser does not support SVG.

The slope of this line is -0.012, which means that each month, we lose on average 1.2 % of our customers. Phrased differently, the every month, there’s a 1.2 % risk that a customer cancels. This is known as the hazard rate in the field.44 You’ll find, if you fit a parametric exponential accelerated failure time model to this data, that it estimates a hazard rate of 1.5 %. This may well be down to the choice of model, and the Kaplan–Meier estimator is free of model assumptions.

As before, we can draw new bootstrap replications and compute these lines for each of them:

Sorry, your browser does not support SVG.

The central 90 % of these slopes lie between 0.88 % and 1.76 %, so we can say that we estimate the hazard rate to be 1.12 %, with a 90 % confidence interval of 0.88 %–1.76 %.

This is a very useful metric to know for your business.