Two Wrongs

Quality in Software Development, Part 4: Statistics

Quality in Software Development, Part 4: Statistics

Normality Considered Harmful

  • in processes where there exist natural boundaries or limits, such as manufacturing, things tend to follow normal distributions
  • or at least, normal-ish distributions
  • what’s great about normal-ish distributions is that when you average them, they become normal (central limit theorem)
  • when learning about processes, there is always a silent assumption that numbers will follow the normal distribution
  • this is so ingrained that it is never even mentioned explicitly
  • you can just assume that the learnings only apply with normal-ish numbers
  • however, in other fields with no such natural limits, such as economics, things are not normally distributed
  • in these fields, things follow heavy-tailed distributions
  • this includes software engineering
  • almost every number of interest in swe is theoretically unbounded: file size, web traffic intensity, latency, whathaveyou
  • these follow heavy-tailed distributions
  • everything you thought you knew about statistics is worthless when dealing with these numbers

Heavy-tailed Distributions

  • mean? doesn’t work
  • standard deviation? whoaah, does NOT work
  • sample size? too small
  • I wish I had good answers here, but I don’t; this is very. hard.
  • if you want to summarise a heavy-tailed sample in one number, use the maximum
  • in two numbers, use the maximum and the minimum
  • in three numbers, use the maximum, the minimum, and the 99th centile
  • in n numbers, store it in a high-resolution histogram
  • I have been looking for a non-parametric way to compare smaller samples of heavy-tailed distributions
  • not yet found one; if you do, please tell me
  • one candidate I found recently is mean absolute deviation over standard deviation
  • MAD/SD takes a number between 0 (very heavy-tailed) and 1 (maximally thin-tailed)
  • I haven’t yet had time to experiment with this, though