## Quality in Software Development, Part 4: Statistics

# Quality in Software Development, Part 4: Statistics

# Normality Considered Harmful

- in processes where there exist natural boundaries or limits, such as manufacturing,
things tend to follow normal distributions
- or at least, normal-ish distributions
- what’s great about normal-ish distributions is that when you average them,
they become normal (central limit theorem)
- when learning about processes, there is always a silent assumption
that numbers will follow the normal distribution
- this is so ingrained that it is never even mentioned explicitly
- you can just assume that the learnings only apply with normal-ish numbers
- however, in other fields with no such natural limits, such as economics, things are not normally distributed
- in these fields, things follow heavy-tailed distributions
- this includes software engineering
- almost every number of interest in swe is theoretically unbounded: file size, web traffic intensity, latency, whathaveyou
- these follow heavy-tailed distributions
- everything you thought you knew about statistics is worthless when dealing with these numbers

# Heavy-tailed Distributions

- mean? doesn’t work
- standard deviation? whoaah, does NOT work
- sample size? too small
- I wish I had good answers here, but I don’t; this is very. hard.
- if you want to summarise a heavy-tailed sample in one number, use the maximum
- in two numbers, use the maximum and the minimum
- in three numbers, use the maximum, the minimum, and the 99th centile
- in
*n* numbers, store it in a high-resolution histogram
- I have been looking for a non-parametric way to compare smaller samples of heavy-tailed distributions
- not yet found one; if you do, please tell me
- one candidate I found recently is mean absolute deviation over standard deviation
- MAD/SD takes a number between 0 (very heavy-tailed) and 1 (maximally thin-tailed)
- I haven’t yet had time to experiment with this, though