Survival Analysis For Customer Retention

kqr

, published 2021-02-02

Tags:

Customer retention matters.1¹ Acquiring new customers costs money, usually more money than keeping the customers we already have. Thus, it’s interesting to know how long our customers stick around.

For this, it’s tempting to use retention data from all customers. Let’s say we have had 9 customers, past and current, and they have been with us for this many months:

They are ordered by size to make it easier to find some summary statistics. We have the following percentile estimations:

Minimum	25 %	50 %	75 %	Maximum
8	12	17	40	70

We can read this as saying “a customer will stay with us for at least 8 months, and at most 70 months. Half of the customers we get will cancel within 17 months.”

We could use these numbers in our business calculations. So are we done? Not quite.

Bias Both Ways

The problem is that any analysis done based on these numbers assumes that all our current customers will cancel their contracts by the end of this month. Why? Because it treats an active customer that has been with us 8 months so far the same as one that canceled their contract after 8 months.

Let’s look at the data again, now with indications for which customers have canceled (X) and which are active (—).

8	12	12	13	17	29	40	59	70
—	—	—	X	—	X	—	X	X

Things look a bit different now:

Whereas previously we thought the shortest time someone had been a customer was 8 months, now all we really know is that it’s at least 8 months.
Whereas we previously thought 25 % of our customers cancel after 12 months, we now know that the first actual cancellation happened at 13 months!

This is a general theme: we don’t know when our current customers will cancel, but we know they will stay on as customers for at least as long as they have currently been customers – maybe longer. In other words, ignoring which customers have cancelled leads to retention data that is biased downward. It leads to underestimating customer retention.

Confronted with this, we might be tempted to make the opposite mistake: only use data from customers who have actually cancelled. This seriously reduces the amount of data we can analyse (in our example, down to just four data points). It can also suffer from another problem: if we have a new batch of recently acquired customers that have not cancelled yet, we fail to account for the fact that they might not stay on as long as our earlier customers.

However, if we would use just the canceled customer data anyway, we would get the following summary statistics:

Minimum	25 %	50 %	75 %	Maximum
13	25	44	62	70

Survival Analysis

This is similar to another problem in medical research: patients die, but not all of them.

Medical (and epidemiological) research is sometimes done by recruiting patients who are at risk for something (maybe the ones who have been hospitalised for some serious disease), and then following up on them every month. Some of them die, and then we know in which month they died. At some point, the study ends, and then some patients are still alive. At that point, we know they have been alive for so-and-so many months, but we don’t know when they will die.

Do you see how that leads to the same sort of data analysis problem we’re having?

Formally, we say that the survival times (or customer retention times) are right-censored, in that we know the true value for some patients (that have actually died) and for the currently alive patients we only know that their death date is greater than the time they have been with us so far.

There’s a branch of statistics called survival analysis that deals with data of this kind. It’s named after the patients dying, but sometimes it’s also called time-to-event analysis because death isn’t the only significant event. In particular, it’s the right tool to analyse customer retention data.

The Kaplan–Meier Estimator

One of the most easily understandable tools of survival analysis is the Kaplan–Meier estimator. Given data on how long a customer has been with us, and whether or not they have canceled, it estimates exactly the thing we wanted to know: how long are customers with us?

To figure that out, we start by estimating the probability of customers staying with us at various points in time. Let’s do two quick examples, to make it more concrete. This is the same table as the one we saw before, repeated for convenience.

8	12	12	13	17	29	40	59	70
—	—	—	X	—	X	—	X	X

We have 4 customers that have made it to 29 months, of which one cancelled at that time. Therefore, we estimate that of the customers that make it to 29 months, 75 % will make it further, and 25 % will cancel – based on historic experience.
We have 3 customers that have stayed with us until 40 months (namely 40, 59, and 70 months.) We know two of them stayed with us longer than 40 months, and we assume the third one will also stay longer, since they haven’t cancelled yet. Thus, we estimate that of the customers that make it to 40 months, 100 % will make it further, and 0 % will cancel.

We have made this computation for all relevant points in time in the following table.

Time	Status	At risk	Survived	Cond. survival
8	—	9	9	100 %
12	—	8	8	100 %
12	—	8	8	100 %
13	X	6	5	83 %
17	—	5	5	100 %
29	X	4	3	75 %
40	—	3	3	100 %
59	X	2	1	50 %
70	X	1	0	0 %

As you can probably figure out, the at risk column tells us how many customers made it to that time, and then survived tells us how many we estimate will continue past that time. The conditional survival column is the proportion that survived out of the number that were at risk.

This is already useful, but the Kaplan–Meier estimation requires one more step: we multiply the conditional survival probabilities with each other, giving us the cumulative survival. The final table will look like

Time	Status	Cond. survival	Cum. survival
8	—	100 %	100 %
12	—	100 %	100 %
12	—	100 %	100 %
13	X	83 %	83 %
17	—	100 %	83 %
29	X	75 %	62 %
40	—	100 %	62 %
59	X	50 %	31 %
70	X	0 %	0 %

To see how the cumulative survival is computed, we use 29 months as an example: the last recorded time we have information for is 17 months, where 83 % of the original 9 had survived. At 29 months, we know only 75 % survive, so the cumulative survived at 29 months is 75 % of 83 %, which is 62 %.

Now we have a proper estimation of customer retention! We can translate this into percentile estimations by linear interpolation, and compare to the naïve methods we tried first.

Method	Minimum	25 %	50 %	75 %	Maximum
All data	8	12	17	40	70
Only cancelled	13	25	44	62	70
Kaplan–Meier	12	21	47	61	70

Note how this is different from our previous attempts: one of our naïve methods underestimated customer retention, and the other overestimates.

Caveat

In practise, I don’t recommend doing these calculations on your own. The devil’s in the details; for example, we need to handle tied observations correctly. Fortunately, there are software packages that can do the heavy lifting for us, like the survival package for r or the lifelines library for Python. If you want to learn more about the details, there are a few books on survival anlysis. I like Applied Survival Analysis2² /Applied Survival Analysis/; Hosmer, Lemeshow, May; Wiley; 2008. for its focus on practice.

Where To Go Next

This is really just the start – what happens next is where it gets truly interesting. Unfortunately, this is also where this article ends. As inspiration, here are directions where this can be taken:

We can plot the cumulative survival from the table above. The resulting curve gives a very visual sense of how long customers stay on. (In fact, this is usually how the Kaplan–Meier estimation is presented.)
We can construct different Kaplan–Meier estimations for different groups of customers, based e.g. on what type of service they are receiving, how much they pay, where they are located, how often we have meetings with them, and so on, to see what effects those things have on customer retention.
We can use bootstrap resampling to get confidence bounds on the Kaplan–Meier estimations. This is important if we want to make decisions based on differences between groups of customers: are the differences real, or just statistical noise?
There are other techniques we can borrow from survival analysis, like Cox proportional hazards models that, when appropriate, can give us a numeric estimation on the difference between groups of customers. (Such as “customers with service A are 30 % less likely to cancel than customers with service B.”)
We have measured customer retention in months, but if we’re a bit creative, we don’t have to measure retention in units of time. If we measure instead how much revenue we get from a customer before they cancel, we have essentially constructed an estimation of customer lifetime value, that accounts for censoring – this is more than most have done.
We’ve used survival analysis for customer retention here, because customer retention is so important. But, as you can guess from the alternative name time-to-event analysis, it can really be used any time you want to know how often something happens, if it happens rarely enough that you can’t wait for it to happen to all subjects (i.e. you want to analyse censored data).

Two Wrongs