# Statistics for Cavemen

# Table of Contents

Caveman not know much. Caveman not get very many animal. Animal hard predict. Caveman want to get better at life. Caveman know: statistics #1 way to get better at life.

How caveman know that? Caveman observe – caveman see – everything is statistics. Everything is random. Some things much random (bright flash from sky), some things less random (sky ablaze with orange and red pretty often.) Caveman know impossible to predict this, but maybe get better with statistics? Maybe statistics what drives world?

Caveman right.

But caveman alone. Nobody tell caveman how right. Not for another 8000 years.

# Must Know

Caveman very simple man. Caveman know plus and minus and some more things. By
freak occurence, caveman also know Python for notational convenience^{1}^{1} Cavemen
were not known for their sophistication, so we should not expect our caveman to
use a more suitable language.. Caveman will use this when chisel learnings into
mountain wall, so not forget.

## Example Data

Caveman collect numbers to practise. Caveman very diligent and continuous
measure heart rate during day^{2}^{2} With an optical wrist hr monitor, namely the
one built into the Garmin ForeRunner 235 model GPS smartwatch. This means
there’s likely a lot of measurement noise in the numbers, and I have no idea
what it does when determining the resting hr from the continuous
measurements. to arrive at resting heart rate for each day. Caveman look at
wall of number, representing resting heart rate for last 106 days.

resting_hr = [ 70, 69, 72, 73, 74, 64, 71, 65, 63, 71, 60, 66, 64, 55, 57, 58, 62, 63, 67, 71, 66, 67, 69, 70, 73, 71, 67, 65, 63, 69, 66, 67, 69, 70, 69, 73, 73, 75, 75, 72, 72, 70, 69, 72, 69, 70, 69, 76, 75, 72, 64, 60, 60, 65, 66, 64, 72, 70, 72, 72, 72, 79, 69, 68, 65, 63, 66, 71, 72, 75, 75, 70, 69, 64, 63, 68, 67, 64, 68, 64, 66, 69, 60, 65, 62, 68, 70, 71, 73, 70, 63, 61, 73, 73, 72, 81, 60, 72, 63, 64, 69, 71, 75, 76, 71, 72 ]

Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> python.el: native completion setup loaded

Caveman see lot of 69, 72, 68 and number close to 70. Caveman learn about averages.

## Measures of Average

- Average
*The*average is something people say way too often. There is not a single number known as the average of a data set. There are many ways to measure what’s average, and different techniques are good for different situations. What all measures of average have in common, though, is that they are attempts at summarising the entire dataset into one value.- Mode
- The mode is probably the simplest form of average: it is computed as the most frequently occurring value in the data set. Mathematically, I’ll write it with a capital \(M\).

Caveman check three most common resting heart rates.

from collections import Counter print(Counter(resting_hr).most_common(3))

[(72, 13), (69, 12), (70, 9)]

Caveman see 72 mode value, follow close by 69, then 70.

Caveman save definition for later.

def M(x): "The most common value in x." [(value, freqency)] = Counter(x).most_common(1) return value print(M(resting_hr))

72

- Median
- The median is a different type of average, being the middle value of
all samples, when they are ordered by size. If there are two middle
values, the median is defined as the value in the middle of those
two values
^{3}^{3}So, if the middle two elements are 9 and 15, then the median is 12, which is equally far from 9 and 15. Notationally, if the measurement variable is called \(x\), the median is written as \(\textrm{md}(x)\).

Caveman immediately find definition interesting, and carve into mountain wall directly.

def even(i): "Return true if i is even, otherwise false." return i & 1 def md(x): "Compute the median for x." n = len(x) - 1 middle = n // 2 return sorted(x)[middle] if even(n) \ else sum(sorted(x)[middle : middle+1])/2 print(md(resting_hr))

69

Caveman look at definition for `md`

and caveman not very happy. Messy^{4}^{4} And
in fact, we could pull some interesting integer-division tricks out of our
sleeves to make it more pleasing to look at, but it would likely also be harder
to grok.. But caveman knows, what has once been etched into side of mountain
very hard to remove.

Caveman not even convinced 69 is middle value. Caveman must look.

```
print(sorted(resting_hr))
```

[55, 57, 58, 60, 60, 60, 60, 60, 61, 62, 62, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 64, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 67, 67, 67, 67, 67, 68, 68, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73, 74, 75, 75, 75, 75, 75, 75, 76, 76, 79, 81]

Caveman not write results down on mountain side because caveman arm tired and long sequence of numbers. Instead, caveman figure out more type of average!

- Arithmetic mean
- The arithmetic mean
^{5}^{5}Which is, obviously, not the same thing as the much less commonly used geometric mean. is probably most common type of average used in everyday speech. It is computed as the sum of values divided by number of values. The arithmetic mean is so common it has gained a special way of writing it: \(\bar{x}\) represents the arithmetic mean of \(x\).

def mean(x): "Compute the sum divided by the length of x." return sum(x)/len(x) print('%.2f' % mean(resting_hr))

68.35

Caveman sad. From computer point of view, this code also not good because
traverse `x`

twice. Arithmetic mean can be computed streaming algorithm, but
caveman simple man. Caveman not know streaming. Caveman only ever deal with data
small enough write on mountain side. Caveman happy again.

Caveman play with average definition, and record amount of sleep nights after hunt, and nights after fight animal.

hunting_sleep = [8, 7, 9, 8, 7, 8, 10, 7] fighting_sleep = [3, 13, 8, 1, 19, 8, 2, 10] print('Hunting: M=%d md=%d mean=%.2f' % ( M(hunting_sleep), md(hunting_sleep), mean(hunting_sleep) )) print('Fighting: M=%d md=%d mean=%.2f' % ( M(fighting_sleep), md(fighting_sleep), mean(fighting_sleep) ))

Hunting: M=8 md=8 mean=8.00 Fighting: M=8 md=8 mean=8.00

Caveman scratch head over result.

Hunting: M=8 md=8 mean=8.00 Fighting: M=8 md=8 mean=8.00

Caveman know for sure fighting more exhausted day after. But average say both same? Caveman insane?

## Measures of Spread

Caveman discover spread. After fight, caveman sleep *a lot*. Or very
little. Exhausting not know beforehand if much or little. After hunting, caveman
always sleep very similar. Caveman like when things similar.

Caveman find way to measure how much different values are.

- Variation width
- The variation width is the difference between the largest and smallest value of the data set.

Very easy maths. Caveman can do in head! Caveman make note.

def varwidth(x): return max(x) - min(x) print('Hunting spread (variation width): %d' % varwidth(hunting_sleep)) print('Fighting spread (variation width): %d' % varwidth(fighting_sleep))

Hunting spread (variation width): 3 Fighting spread (variation width): 18

Now clear as day! Fighting bad. But maybe exaggerated result. Single very large or small value give big variation width. Must be better way.

Maybe remove some extreme samples? Maybe remove 25% on each side, leaving only the middle half of the samples?

- Interquartile range
- The interquartile range, more conveniently written as
iqr, is the range of values covered by the middle half of the data. It is
called such because you have divided the data into quarters: the smallest
quarter and the largest quarter is removed. The borders between the
quarters are called
*quartiles*, and the iqr is measuring the distance between two quartiles.^{6}^{6}It is important to understand the implications of this. When you are computing the iqr, you are in fact computing the variation width for a much, much smaller data set, and the outliers removed have*no*effect on the result at all.

def iqr(x): n = len(x) - 1 q1 = n // 4 q3 = 3 * n // 4 return varwidth(sorted(x)[q1 : q3]) print('Hunting spread (interquartile range): %d' % iqr(hunting_sleep)) print('Fighting spread (interquartile range): %d' % iqr(fighting_sleep))

Hunting spread (interquartile range): 1 Fighting spread (interquartile range): 6

Caveman think this pretty cool, and probably good measure of spread.

But then caveman get flash insight.

Caveman take tone of British aristocrat, perhaps Oxford attendee, and say,

Oi, chaps, I may have got it! What if we measure spread as the average square distance from the mean? That seems nice! We’ll call it the

varianceof the data.

- Variance
- The variance is what you get if you, for each sample, compute the
square of its distance to the mean, and then divide the sum of
those numbers by \(n - 1\). This is a mean in some sense (it’s a sum
divided by a count, after all), but it’s not the mean of the
samples – it’s the mean of the square distances from the samples
to the mean of the samples. This is expressed by capital letter
\(V\) in maths
^{7}^{7}And sometimes, just to confuse, it’s written as \(s^2\) and you will very shortly see why..

Caveman sketch advanced maths formula.

\[ V = \frac{\sum \left(x - \bar{x}\right)^2}{n - 1} \]

Caveman barely understand own formula.^{8}^{8} Something about subtracting the mean
from a sample and squaring the result. Then computing that number for all
samples, and summing those numbers, and dividing by \(n - 1\), because for \(n\)
samples there are \(n - 1\) distances from the mean. Caveman invent formula that
easier to compute on mountain wall.

\[ V = \frac{\sum x^2 - \frac{\left(\sum x\right)^2}{n}}{n - 1} \]

This formula computed in parts – good for mountain wall!

def V(x): n = len(x) sum_of_squares = sum(x**2 for x in x) square_of_sums = sum(x)**2 square_distance = sum_of_squares - square_of_sums/n return square_distance / (n - 1) print("Hunting spread (variance): %.2f" % V(hunting_sleep)) print("Fighting spread (variance): %.2f" % V(fighting_sleep))

Hunting spread (variance): 1.14 Fighting spread (variance): 37.14

Caveman little confused. This also seem unfair, with huge number for fighting and tiny number for hunting. Caveman decide must be because square distance. Caveman decide to try square root of variance. Caveman great inventor of statistical measures!!

- Standard deviation
- The infamous standard deviation is the square root of
the variance. In other words, the standard deviation is in some sense a
measure of the average distance from the mean. Mathematically, it is often
expressed as the Latin lower-case letter \(s\)
^{9}^{9}And this, then, is why we can write \(V = s^2\), because \(s = \sqrt{V}\)., or the greek lower-case letter \(\sigma\).

import math def s(x): return math.sqrt(V(x)) print("Hunting spread (standard deviation): %.2f" % s(hunting_sleep)) print("Fighting spread (standard deviation): %.2f" % s(fighting_sleep))

Hunting spread (standard deviation): 1.07 Fighting spread (standard deviation): 6.09

Caveman think this look fair. Caveman make note that this time, standard deviation and interquartile range were very similar. Caveman think maybe general pattern? But not sure.

Writing last definition, caveman almost scribble over heart rate data on mountain side. Caveman almost give self heart attack. Just for safe, caveman try to summarise resting heart rate data using all new learnings.

print('Mode Median Mean') print(' %d %d %.2f' % ( M(resting_hr), md(resting_hr), mean(resting_hr) )) print('') print('Var.width IQR Variance Std.dev.') print(' %d %d %.2f %.2f' % ( varwidth(resting_hr), iqr(resting_hr), V(resting_hr), s(resting_hr) ))

Mode Median Mean 72 69 68.35 Var.width IQR Variance Std.dev. 26 7 23.30 4.83

Caveman very curious man. Caveman make experiment.

HR | Occurrences | Measure |
---|---|---|

55 | I | |

56 | ||

57 | I | |

58 | I | |

59 | ||

60 | IIIII | |

61 | I | |

62 | II | |

63 | IIIIIII | |

64 | IIIIIIII | 🡄 mean - std. dev. |

65 | IIIII | |

66 | IIIIII | |

67 | IIIII | |

68 | IIII | 🡄 mean |

69 | IIIIIIIIIIII | 🡄 median |

70 | IIIIIIIII | |

71 | IIIIIIII | |

72 | IIIIIIIIIIIII | 🡄 mode |

73 | IIIIIII | 🡄 mean + std. dev. |

74 | I | |

75 | IIIIII | |

76 | II | |

77 | ||

78 | ||

79 | II | |

80 | ||

81 | I |

Caveman make quick mountain side maths. How many sample lie inside \(\bar{x} \pm s\)?

print('%.0f %%' % (100 * sum(count for (xi, count) in Counter(resting_hr).items() if 63 < xi < 74 ) / len(resting_hr)))

73 %

Caveman make mental note: many sample closer to mean than average distance to mean.