Two Wrongs

Verifiable Software Development Estimations

Verifiable Software Development Estimations

I have long wanted to write this article but I never knew how to structure it. I still don’t, but I’ll give it a shot anyway. Like in the past few articles, there’s some probability theory underlying this article, but I will gloss over it entirely to make it approachable for practically-minded people.

Summary

I have heard that I need to practice bluf, i.e. summarising my thoughts before I ask readers to embark on a longer text. Here is the main gist of this article:

  • When you estimate, estimate a calendar date, not a relative measure.
  • Always attach a probability to the calendar date, indicating your confidence that the task will be done earlier than that date.
  • Continuously evaluate how accurate your estimations are, in light of the confidence you assigned. Improve your estimation technique to become more accurate over time.
  • If your organisation is receptive to it, estimate a range of completion dates instead of just an upper bound.
  • Don’t estimate task effort (time to completion) – instead, estimate the value of the task.

I do recommend you read on to understand what each of the above means, though.

Improving at Estimating

In order to improve at anything, we need to know how well we are doing. Software develoment estimation is really a form of forecasting, and in forecasting “knowing how well we are doing” is known as verifiability. We need to make verifiable forecasts in order to know how good we are at making them. We need to be able to state, after the fact, whether our forecast was definitively correct or incorrect.

This might sound impossible! You may think of estimation as an activity where we predict that a task will take one week, and then it ends up taking anywhere between two days and five weeks. It will never take exactly one week down to the hour. So when can we say that our forecast was definitively correct? Under this model, we can’t.

The way most people do software estimation is not verifiable!

This means it’s difficult to improve using that method.

Verifiable Estimations

There is a way to make verifiable forecasts, and it’s the way meteorologists do it behind the scenes: forecasts must be probabilistic.1 You may never have heard of probabilistic weather forecasts, but for example ecmwf offers them to the public on their website, under the name meteograms. They also have one for precipitation type. Instead of saying that a task will take one week, we can say that there’s a 50 % chance it will be done later than March 15. We pick a date where we truly feel there’s a 50 % chance that it’s done sooner than that, and 50 % chance that it’s done later than that.

It sounds like a small change, but it contains two huge conceptual shifts: our estimations are now (a) relevant2 Business cares about calendar dates, not story points or any other relative measures., and (b) verifiable. Now, we can say definitely whether the task was done earlier or later than the date estimated.

Once we’ve made a few of these forecasts, we can also evaluate how accurate we are. If we look back at our past record of estimations, then by definition, we should find that roughly3 How roughly? That takes some statistical theory and I won’t go through at now. Rest assured that your gut feeling is probably good enough. half the tasks were done before the estimated date, and the other half after it. This is objective fact, which is what makes it useful for learning to issue better estimations.

If our tasks finish earlier more often than not, we are making estimations that are too conservative. If our tasks finish later more often than not, then our estimations are too ambitious. When we find ourselves in this situation, we shouldn’t just blindly adjust future estimations by some amount, but instead try to find out why we are making biased estimations. This can happen due to external pressure, of course, but just as likely culprits are internal mental biases, like anchoring and the availability heuristic. There are techniques to counteract these.

When we have compared our estimations to reality and made appropriate adjustments to our technique so we are reasonably sure of being (probabilistically) accurate, one says that we are making calibrated estimations.

Tail Risk

Even if we make dead-accurate 50 % estimations, people around us will not be happy. I think you can guess why: they will find it difficult what to make of coinflip estimations, where it’s just as likely the task will be late as early. The reason is that the cost of something being late is usually much greater than the benefit of it being early. In planning, we are generally more sensitive to tail risk.

We can deal with this with a small change of procedure: make 90 % estimations instead, i.e. estimate the date where we are 90 % certain the task will be done before that date. When we have made 50 of these estimations, roughly 45 of the corresponding tasks should have been early, and five late. If not, we need to improve our technique.4 If you’re in an industry where 10 % late projects is not acceptable, pick a larger percentile, like 95 % or 98 %. If you’re in an industry where people expect projects to be late, you may even be able to stick to 50 % estimations – just keep verifying that you’re accurate!

Verifiability helps with learning, but it also helps with telling off our manager. When we make 90 % estimations, these are naturally quite conservative. We are guaranteed to come across someone who will go, “Yeeeah, that doesn’t really work for our customer. Is there nothing you can do to make it faster?”

When our estimations are verifiable and we have adjusted our technique appropriately, it’s much easier to confidently say,

I have a solid track record of calibrated estimations. The date I have given you is the true date for having only a 10 % risk of being late. If you want to take a greater risk of being late, we can discuss that. But if you want to avoid being late, there is nothing I can do. The date I have given you stands.

I can give you a different date if you want – the calendar is full of them, just pick one you like the sound of. That date will be completely meaningless. The date I’ve already given you is usable in spreadsheets and other economic and risk calculations.

So, how would you like it? Do you want the true date or a meaningless one?

The type of wishful thinking where one requests a sooner date based on nothing meaningful at all is very common, and one we will butt heads with particularly frequently when we make calibrated estimations.5 If you’re in a more competitive environment, an alternative response is “Wanna bet?” You can convert a probabilistic estimation into odds you’re willing to offer or accept on the project being late. In the short run, you will lose money to people who suggest the project will be done sooner (because your estimation is conservative) but in long run, you’ll earn that back on the big 10× payoffs you get when the project does run late. And sometimes, just mentioning that 10× penalty for a late project will make people realise that they’re not willing to take the bet, because secretly they know you are right.

Range Estimation

One issue with the estimations we’ve given already is that they don’t convey the uncertainty level of the task.

  • If you’re very familiar with a task, having performed variations of it multiple times in the past, you may know that it takes very close to two months, maybe plus or minus a couple of weeks.
  • If there’s a completely foreign task you’re being asked about, you might think it may take anywhere between one week and six months, but you’re 90 % confident it will take less than two months.

Both of these tasks have the same 90 % upper bound estimation – two months – but yet in important ways they are very different from each other.

A convenient way to indicate this difference is to estimate a 90 % range – i.e. both a lower and an upper bound, rather than just the 90 % upper bound. So instead of giving the date we are 90 % certain it will be done before, we give two dates: one date that has only a 5 % probability of seeing the task done, and another that has a 95 % probability of seeing the task done. In effect, we have shifted the 90 % we care about from the range of 0–90 % into the range of 5–95 %.6 Again, adjust the specific probabilities based on the expectations of your industry.

The familiar task may be estimated as, “I am 90 % certain it will be done between March 14 and April 7”, whereas the novel one would elicit, “I am 90 % certain it will be done between the start of February and the end of June.”7 Once our estimated range is very large, the specific dates matter less than the general time of year, of course. In fact, specific dates for large ranges may invite people to think our estimations are more confident than the 90 % they represent.

The reason this doesn’t always work is that few organisations are receptive to getting two estimations for one task. They have (mental or technical) processes that deal with just one estimation per task, and they don’t really know what to do with two.8 Here’s one thing you can do: use both numbers to construct a log-normal distribution of completion dates, and then run simulations that include dependencies between projects, and you can learn a heck of a lot about how your parameters affect the outcome. Feel free to just give an upper bound to your organisation, but secretly keep track of the full range yourself. That way, when you happen to come across an organisation that appreciates the full range, you have already trained and calibrated yourself in providing it.

Estimate Value, not Effort

Now that you know how to estimate when a task will be done, I will sweep the rug out from under your feet: it’s a waste of time to estimate when a task will be done. Small tasks (e.g. bug fixes) almost always take somewhere between a day and a couple of weeks. Large undertakings (e.g. new features) almost always take somewhere between a week and a few months.9 Anything that takes longer than a few months should be approached piecemeal and not committed to fully. So given a particular type of task, the longest possible time required to complete it is 20× longer than the shortest possible time required to complete it. This is a small range of variation. For internal planning, you can just assume that any task is roughly the same as any other task of the same size. A small task will take a few days and a large task will take a couple of months, and on average it will work out. It just adds unnecessary stress to estimate the effort for every single thing. Only tasks that involve coordination with external collaborators need to be estimated individually.

What’s interesting is that the value a customer (internal or external) gets out of a task can vary hugely. Some small tasks only save the organisation $20, whereas others can easily save $30,000. Some large tasks can earn the organisation $2,000 in their first year, whereas others earn $1,800,000 over the same period. This is a range of variation exceeding 800×.

So what really makes a difference for prioritisation is not how quickly something can be done, but how much value the organisation gets out of it. That is what should be estimated, not how long it will take. It is really important that you estimate the value of all tasks you plan on doing, lest you may accidentally be stuck doing $100 tasks when you could be doing $5,000 tasks.

I cannot emphasise this enough. Only once you’re really good at estimating value does it make sense to start estimating effort. There is a large competitive advantage to be had here.10 The obvious counter is “I already know what’s valuable!” Well, have you made verifiable estimations? Have you confirmed objectively that your sense of what’s valuable is correct? You might be surprised.

Referencing This Article