# Precel: Like Excel but for Uncertain Values

** Task effort estimation**: Imagine you have two programming tasks, and you have
estimated the first will take somewhere between 10 and 40 days to complete (at
90 % confidence), while the other will take between 5 and 25 days. How long will
it take to do both tasks in sequence? For this and the questions below, assume
that your estimations are calibrated and that the uncertain values are
uncorrelated.

If you answered 15–65 days, you made the same mistake almost everyone makes. This mistake is even encoded as a mandatory step into some estimation methodologies – I’m looking at you, pert. The actual answer is 20–55 days.

** Driving time estimation**: Next, imagine that you are going to move for work,
but you don’t know exactly where yet. However, you know it will be 10–30 km from
the workplace, and that the roads in between will let you drive at 50–90 km/h at
rush hour. How long will it take you to get to work?

If you answered 7–36 minutes, you made the same mistake again.^{1} If you
answered 12–20 minutes, you made another mistake. Think about what the numbers
represent! The actual answer is 9–28 minutes.

If you still don’t see what’s happened, let’s try another variant:

** Grocery cost estimation**: You estimate your future weekly grocery purchases
will total $80–$150. How much should you estimate your yearly grocery purchases
to total? It’s not $4k–$8k – it is a narrower interval: $5.5k–$6k.

In all cases, the naïve combination overestimates the width of the actual interval. Why?

# Uncorrelated values cancel out a little

The reason that the true interval is smaller than we instinctively think is that uncertain values, when uncorrelated, cancel each other out a little.

For the two tasks in the first example, we can think of how long they take as belonging to a very fast scenario, very slow scenario, or average. Then, very crudely speaking, these are the possible outcomes for the time it takes both tasks:

Task B | Task A | 10 days | 20 days | 40 days |
---|---|---|---|---|

5 days |
15 days | 25 days | 45 days | |

15 days |
25 days | 35 days | 55 days | |

25 days |
35 days | 55 days | 65 days |

To get a completion time of 15 days for the two tasks together, we need to
complete *both* tasks very quickly. Similarly, a completion time of 65 days can
only happen if we complete *both* tasks very slowly. The most likely outcome is
somewhere in the middle, which according to this table is 25–55 days. More
computationally intense analysis^{2} Considering more outcomes, weighing more in
favour of ones close to average, and taking the 90 % central values of the
result. gives the answer 20–55.

This is the *law of large numbers*: as we add more numbers, the standard deviation
of the sum grows slower than the sum itself (square root vs. linear), so the
spread of the distribution of the sum narrows. Eventually the Studentised spread
converges to zero.

# Convergence helps us estimate more precisely

This effect is useful because it allows us to make more precise estimations by breaking down the uncertain value to be estimated. If I had to guess the distance between Stockholm and Gothenburg, I would give the interval of 300–800 km, which has a factor 2.7 between high and low.

However, I happen to know the time it takes to go between the cities by train is just over three hours. I don’t know the speed of the train, but I’m reasonably sure it’s in the range of 130–170 km/h. If I multiply (3.0–3.2) hours by (130–170) km/h, I get a new estimation for the distance: 400–530 km. This has just a factor 1.3 between high and low, meaning the precision of the estimation has been doubled!

Decomposing estimations into components that are easier to estimate is a very powerful tool, but it does require that one is able to re-combine multiple estimations into one later on.

# We need computer assistance

Unfortunately, it is difficult to combine distributions with just mental maths. The easiest way I know is through simulation: draw 5000 values from both distributions and add them up individually. Check where the resulting distribution ends up. This is very easy to write the code for, but also very annoying to write fresh code for every time it’s needed.

So I started writing a tool for it, called *Precel*. For the distance estimation
we looked at in the last example, run the command

```
$ ./precel '130 170 interval 3 3.2 interval multiply'
130 170 interval 3 3.2 interval multiply =
401.54 [461.41] 528.55
```

The syntax looks funky and is a little clunky but that’s because I have invested
minimal effort into it so far. It’s postfix, so `130 170 interval`

means “a 90 %
interval estimation of 130–170,” and `a b multiply`

means elementwise a × b.
Check the readme for further instructions.

The tool is in very rough shape and I don’t plan on actively working on it. At
the moment it satisfies my most immediate demands, and as I need more advanced
features I will add them.^{3} There are some things I suspect I will need in the
future like reading distributions from files, writing out correlations and
intermediate steps that would be useful for interactive use and diagnostics, as
well as interoperation with itself.

# Excel for uncertain values

One of my dreams is eventually writing a viable competitor to Excel, but instead
of being limited to single numbers in cells, it can handle probability
distributions. There are some plugins to Excel that do this^{4} And you can
actually emulate it by encoding generated random values as a comma-separated
string, splitting and joining it and using array operations to calculate with
it. I have done this once, and the result was the world’s slowest spreadsheet.
Took minutes to propagate changes!, and some external tools, but they are
either proprietary, or suck, or both. I think world deserves a good, open source
implementation. As humanity moves forward, we will all need to be more
probabilistically literate, and that’s hard when the tools that run the world
are created in complete ignorance of probabilistic principles.

In the meantime, I have *Precel*. With the right features and performance
optimisations, it may even become the engine that powers a probilistic
spreadsheet. Who knows?