# Survival Analysis For Customer Retention

Customer retention matters.^{1}^{1} Acquiring new customers costs money, usually
more money than keeping the customers we already have. Thus, it’s interesting
to know how long our customers stick around.

For this, it’s tempting to use retention data from all customers. Let’s say we have had 9 customers, past and current, and they have been with us for this many months:

8 | 12 | 12 | 13 | 17 | 29 | 40 | 59 | 70 |

They are ordered by size to make it easier to find some summary statistics. We have the following percentile estimations:

Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|

8 | 12 | 17 | 40 | 70 |

We can read this as saying “a customer will stay with us for at least 8 months, and at most 70 months. Half of the customers we get will cancel within 17 months.”

We could use these numbers in our business calculations. So are we done? Not quite.

# Bias Both Ways

The problem is that any analysis done based on these numbers assumes that all
our current customers will cancel their contracts by the end of this month. Why?
Because it treats an active customer that has been with us 8 months *so far* the
same as one that canceled their contract after 8 months.

Let’s look at the data again, now with indications for which customers have canceled (X) and which are active (—).

8 | 12 | 12 | 13 | 17 | 29 | 40 | 59 | 70 |

— | — | — | X | — | X | — | X | X |

Things look a bit different now:

- Whereas previously we thought the shortest time someone had been a customer
was 8 months, now all we really know is that it’s
*at least*8 months. - Whereas we previously thought 25 % of our customers cancel after 12 months, we
now know that the
*first*actual cancellation happened at 13 months!

This is a general theme: we don’t know when our current customers will cancel,
but we know they will stay on as customers for *at least* as long as they have
currently been customers – maybe longer. In other words, ignoring which
customers have cancelled leads to retention data that is biased downward. It
leads to underestimating customer retention.

Confronted with this, we might be tempted to make the opposite mistake: only use data from customers who have actually cancelled. This seriously reduces the amount of data we can analyse (in our example, down to just four data points). It can also suffer from another problem: if we have a new batch of recently acquired customers that have not cancelled yet, we fail to account for the fact that they might not stay on as long as our earlier customers.

However, if we would use just the canceled customer data anyway, we would get the following summary statistics:

Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|

13 | 25 | 44 | 62 | 70 |

# Survival Analysis

This is similar to another problem in medical research: patients die, but not all of them.

Medical (and epidemiological) research is sometimes done by recruiting patients who are at risk for something (maybe the ones who have been hospitalised for some serious disease), and then following up on them every month. Some of them die, and then we know in which month they died. At some point, the study ends, and then some patients are still alive. At that point, we know they have been alive for so-and-so many months, but we don’t know when they will die.

Do you see how that leads to the same sort of data analysis problem we’re having?

Formally, we say that the survival times (or customer retention times) are
*right-censored*, in that we know the true value for some patients (that have
actually died) and for the currently alive patients we only know that
their death date is greater than the time they have been with us so far.

There’s a branch of statistics called *survival analysis* that deals with data of
this kind. It’s named after the patients dying, but sometimes it’s also called
*time-to-event analysis* because death isn’t the only significant event. In
particular, it’s the right tool to analyse customer retention data.

# The Kaplan–Meier Estimator

One of the most easily understandable tools of survival analysis is the Kaplan–Meier estimator. Given data on how long a customer has been with us, and whether or not they have canceled, it estimates exactly the thing we wanted to know: how long are customers with us?

To figure that out, we start by estimating the probability of customers staying with us at various points in time. Let’s do two quick examples, to make it more concrete. This is the same table as the one we saw before, repeated for convenience.

8 | 12 | 12 | 13 | 17 | 29 | 40 | 59 | 70 |

— | — | — | X | — | X | — | X | X |

- We have 4 customers that have made it to 29 months, of which one cancelled at that time. Therefore, we estimate that of the customers that make it to 29 months, 75 % will make it further, and 25 % will cancel – based on historic experience.
- We have 3 customers that have stayed with us until 40 months (namely 40, 59, and 70 months.) We know two of them stayed with us longer than 40 months, and we assume the third one will also stay longer, since they haven’t cancelled yet. Thus, we estimate that of the customers that make it to 40 months, 100 % will make it further, and 0 % will cancel.

We have made this computation for all relevant points in time in the following table.

Time | Status | At risk | Survived | Cond. survival |
---|---|---|---|---|

8 | — | 9 | 9 | 100 % |

12 | — | 8 | 8 | 100 % |

12 | — | 8 | 8 | 100 % |

13 | X | 6 | 5 | 83 % |

17 | — | 5 | 5 | 100 % |

29 | X | 4 | 3 | 75 % |

40 | — | 3 | 3 | 100 % |

59 | X | 2 | 1 | 50 % |

70 | X | 1 | 0 | 0 % |

As you can probably figure out, the *at risk* column tells us how many
customers made it to that time, and then *survived* tells us how many we
estimate will continue past that time. The *conditional survival* column is the
proportion that survived out of the number that were at risk.

This is already useful, but the Kaplan–Meier estimation requires one more step:
we multiply the conditional survival probabilities with each other, giving us
the *cumulative survival*. The final table will look like

Time | Status | Cond. survival | Cum. survival |
---|---|---|---|

8 | — | 100 % | 100 % |

12 | — | 100 % | 100 % |

12 | — | 100 % | 100 % |

13 | X | 83 % | 83 % |

17 | — | 100 % | 83 % |

29 | X | 75 % | 62 % |

40 | — | 100 % | 62 % |

59 | X | 50 % | 31 % |

70 | X | 0 % | 0 % |

To see how the cumulative survival is computed, we use 29 months as an example: the last recorded time we have information for is 17 months, where 83 % of the original 9 had survived. At 29 months, we know only 75 % survive, so the cumulative survived at 29 months is 75 % of 83 %, which is 62 %.

Now we have a proper estimation of customer retention! We can translate this into percentile estimations by linear interpolation, and compare to the naïve methods we tried first.

Method | Minimum | 25 % | 50 % | 75 % | Maximum |
---|---|---|---|---|---|

All data | 8 | 12 | 17 | 40 | 70 |

Only cancelled | 13 | 25 | 44 | 62 | 70 |

Kaplan–Meier | 12 | 21 | 47 | 61 | 70 |

Note how this is different from our previous attempts: one of our naïve methods underestimated customer retention, and the other overestimates.

## Caveat

In practise, I don’t recommend doing these calculations on your own. The devil’s
in the details; for example, we need to handle tied observations correctly.
Fortunately, there are software packages that can do the heavy lifting for us,
like the *survival* package for r or the *lifelines* library for Python. If
you want to learn more about the details, there are a few books on survival
anlysis. I like *Applied Survival Analysis*^{2}^{2} *Applied Survival Analysis*;
Hosmer, Lemeshow, May; Wiley; 2008. for its focus on practice.

# Where To Go Next

This is really just the start – what happens next is where it gets truly interesting. Unfortunately, this is also where this article ends. As inspiration, here are directions where this can be taken:

- We can plot the cumulative survival from the table above. The resulting curve gives a very visual sense of how long customers stay on. (In fact, this is usually how the Kaplan–Meier estimation is presented.)
- We can construct different Kaplan–Meier estimations for different groups of customers, based e.g. on what type of service they are receiving, how much they pay, where they are located, how often we have meetings with them, and so on, to see what effects those things have on customer retention.
- We can use bootstrap resampling to get confidence bounds on the Kaplan–Meier estimations. This is important if we want to make decisions based on differences between groups of customers: are the differences real, or just statistical noise?
- There are other techniques we can borrow from survival analysis, like Cox proportional hazards models that, when appropriate, can give us a numeric estimation on the difference between groups of customers. (Such as “customers with service A are 30 % less likely to cancel than customers with service B.”)
- We have measured customer retention in months, but if we’re a bit creative, we don’t have to measure retention in units of time. If we measure instead how much revenue we get from a customer before they cancel, we have essentially constructed an estimation of customer lifetime value, that accounts for censoring – this is more than most have done.
- We’ve used survival analysis for customer retention here, because customer
retention is so important. But, as you can guess from the alternative name
*time-to-event*analysis, it can really be used any time you want to know how often something happens, if it happens rarely enough that you can’t wait for it to happen to all subjects (i.e. you want to analyse censored data).