Two Wrongs

Statistics for Cavemen

Statistics for Cavemen

Caveman not know much. Caveman not get very many animal. Animal hard predict. Caveman want to get better at life. Caveman know: statistics #1 way to get better at life.

How caveman know that? Caveman observe – caveman see – everything is statistics. Everything is random. Some things much random (bright flash from sky), some things less random (sky ablaze with orange and red pretty often.) Caveman know impossible to predict this, but maybe get better with statistics? Maybe statistics what drives world?

Caveman right.

But caveman alone. Nobody tell caveman how right. Not for another 8000 years.

Must Know

Caveman very simple man. Caveman know plus and minus and some more things. By freak occurence, caveman also know Python for notational convenience11 Cavemen were not known for their sophistication, so we should not expect our caveman to use a more suitable language.. Caveman will use this when chisel learnings into mountain wall, so not forget.

Example Data

Caveman collect numbers to practise. Caveman very diligent and continuous measure heart rate during day22 With an optical wrist hr monitor, namely the one built into the Garmin ForeRunner 235 model GPS smartwatch. This means there’s likely a lot of measurement noise in the numbers, and I have no idea what it does when determining the resting hr from the continuous measurements. to arrive at resting heart rate for each day. Caveman look at wall of number, representing resting heart rate for last 106 days.

resting_hr = [
    70, 69, 72, 73, 74, 64, 71, 65, 63, 71, 60,
    66, 64, 55, 57, 58, 62, 63, 67, 71, 66, 67,
    69, 70, 73, 71, 67, 65, 63, 69, 66, 67, 69,
    70, 69, 73, 73, 75, 75, 72, 72, 70, 69, 72, 
    69, 70, 69, 76, 75, 72, 64, 60, 60, 65, 66, 
    64, 72, 70, 72, 72, 72, 79, 69, 68, 65, 63, 
    66, 71, 72, 75, 75, 70, 69, 64, 63, 68, 67, 
    64, 68, 64, 66, 69, 60, 65, 62, 68, 70, 71, 
    73, 70, 63, 61, 73, 73, 72, 81, 60, 72, 63, 
    64, 69, 71, 75, 76, 71, 72
]
Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> python.el: native completion setup loaded

Caveman see lot of 69, 72, 68 and number close to 70. Caveman learn about averages.

Measures of Average

Average
The average is something people say way too often. There is not a single number known as the average of a data set. There are many ways to measure what’s average, and different techniques are good for different situations. What all measures of average have in common, though, is that they are attempts at summarising the entire dataset into one value.
Mode
The mode is probably the simplest form of average: it is computed as the most frequently occurring value in the data set. Mathematically, I’ll write it with a capital \(M\).

Caveman check three most common resting heart rates.

from collections import Counter

print(Counter(resting_hr).most_common(3))
[(72, 13), (69, 12), (70, 9)]

Caveman see 72 mode value, follow close by 69, then 70.

Caveman save definition for later.

def M(x):
    "The most common value in x."
    [(value, freqency)] = Counter(x).most_common(1)
    return value

print(M(resting_hr))
72

Median
The median is a different type of average, being the middle value of all samples, when they are ordered by size. If there are two middle values, the median is defined as the value in the middle of those two values33 So, if the middle two elements are 9 and 15, then the median is 12, which is equally far from 9 and 15. Notationally, if the measurement variable is called \(x\), the median is written as \(\textrm{md}(x)\).

Caveman immediately find definition interesting, and carve into mountain wall directly.

def even(i):
    "Return true if i is even, otherwise false."
    return i & 1

def md(x):
    "Compute the median for x."
    n = len(x) - 1
    middle = n // 2
    return sorted(x)[middle] if even(n) \
        else sum(sorted(x)[middle : middle+1])/2

print(md(resting_hr))
69

Caveman look at definition for md and caveman not very happy. Messy44 And in fact, we could pull some interesting integer-division tricks out of our sleeves to make it more pleasing to look at, but it would likely also be harder to grok.. But caveman knows, what has once been etched into side of mountain very hard to remove.

Caveman not even convinced 69 is middle value. Caveman must look.

print(sorted(resting_hr))
[55, 57, 58, 60, 60, 60, 60, 60, 61, 62, 62, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 64, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 67, 67, 67, 67, 67, 68, 68, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73, 74, 75, 75, 75, 75, 75, 75, 76, 76, 79, 81]

Caveman not write results down on mountain side because caveman arm tired and long sequence of numbers. Instead, caveman figure out more type of average!

Arithmetic mean
The arithmetic mean55 Which is, obviously, not the same thing as the much less commonly used geometric mean. is probably most common type of average used in everyday speech. It is computed as the sum of values divided by number of values. The arithmetic mean is so common it has gained a special way of writing it: \(\bar{x}\) represents the arithmetic mean of \(x\).
def mean(x):
    "Compute the sum divided by the length of x."
    return sum(x)/len(x)

print('%.2f' % mean(resting_hr))
68.35

Caveman sad. From computer point of view, this code also not good because traverse x twice. Arithmetic mean can be computed streaming algorithm, but caveman simple man. Caveman not know streaming. Caveman only ever deal with data small enough write on mountain side. Caveman happy again.

Caveman play with average definition, and record amount of sleep nights after hunt, and nights after fight animal.

hunting_sleep = [8, 7, 9, 8, 7, 8, 10, 7]
fighting_sleep = [3, 13, 8, 1, 19, 8, 2, 10]

print('Hunting:  M=%d  md=%d   mean=%.2f' % (
    M(hunting_sleep),
    md(hunting_sleep),
    mean(hunting_sleep)
))
print('Fighting: M=%d  md=%d   mean=%.2f' % (
    M(fighting_sleep),
    md(fighting_sleep),
    mean(fighting_sleep)
))
Hunting:  M=8  md=8   mean=8.00
Fighting: M=8  md=8   mean=8.00

Caveman scratch head over result.

Hunting:  M=8  md=8   mean=8.00
Fighting: M=8  md=8   mean=8.00

Caveman know for sure fighting more exhausted day after. But average say both same? Caveman insane?

Measures of Spread

Caveman discover spread. After fight, caveman sleep a lot. Or very little. Exhausting not know beforehand if much or little. After hunting, caveman always sleep very similar. Caveman like when things similar.

Caveman find way to measure how much different values are.

Variation width
The variation width is the difference between the largest and smallest value of the data set.

Very easy maths. Caveman can do in head! Caveman make note.

def varwidth(x):
    return max(x) - min(x)

print('Hunting spread (variation width): %d' %
    varwidth(hunting_sleep))
print('Fighting spread (variation width): %d' %
    varwidth(fighting_sleep))
Hunting spread (variation width): 3
Fighting spread (variation width): 18

Now clear as day! Fighting bad. But maybe exaggerated result. Single very large or small value give big variation width. Must be better way.

Maybe remove some extreme samples? Maybe remove 25% on each side, leaving only the middle half of the samples?

Interquartile range
The interquartile range, more conveniently written as iqr, is the range of values covered by the middle half of the data. It is called such because you have divided the data into quarters: the smallest quarter and the largest quarter is removed. The borders between the quarters are called quartiles, and the iqr is measuring the distance between two quartiles. 66 It is important to understand the implications of this. When you are computing the iqr, you are in fact computing the variation width for a much, much smaller data set, and the outliers removed have no effect on the result at all.
def iqr(x):
    n = len(x) - 1
    q1 = n // 4
    q3 = 3 * n // 4
    return varwidth(sorted(x)[q1 : q3])

print('Hunting spread (interquartile range): %d' %
    iqr(hunting_sleep))
print('Fighting spread (interquartile range): %d' %
    iqr(fighting_sleep))
Hunting spread (interquartile range): 1
Fighting spread (interquartile range): 6

Caveman think this pretty cool, and probably good measure of spread.

But then caveman get flash insight.

Caveman take tone of British aristocrat, perhaps Oxford attendee, and say,

Oi, chaps, I may have got it! What if we measure spread as the average square distance from the mean? That seems nice! We’ll call it the variance of the data.

Variance
The variance is what you get if you, for each sample, compute the square of its distance to the mean, and then divide the sum of those numbers by \(n - 1\). This is a mean in some sense (it’s a sum divided by a count, after all), but it’s not the mean of the samples – it’s the mean of the square distances from the samples to the mean of the samples. This is expressed by capital letter \(V\) in maths77 And sometimes, just to confuse, it’s written as \(s^2\) and you will very shortly see why..

Caveman sketch advanced maths formula.

\[ V = \frac{\sum \left(x - \bar{x}\right)^2}{n - 1} \]

Caveman barely understand own formula.88 Something about subtracting the mean from a sample and squaring the result. Then computing that number for all samples, and summing those numbers, and dividing by \(n - 1\), because for \(n\) samples there are \(n - 1\) distances from the mean. Caveman invent formula that easier to compute on mountain wall.

\[ V = \frac{\sum x^2 - \frac{\left(\sum x\right)^2}{n}}{n - 1} \]

This formula computed in parts – good for mountain wall!

def V(x):
    n = len(x)
    sum_of_squares = sum(x**2 for x in x)
    square_of_sums = sum(x)**2
    square_distance = sum_of_squares - square_of_sums/n
    return square_distance / (n - 1)

print("Hunting spread (variance): %.2f" %
    V(hunting_sleep))
print("Fighting spread (variance): %.2f" %
    V(fighting_sleep))
Hunting spread (variance): 1.14
Fighting spread (variance): 37.14

Caveman little confused. This also seem unfair, with huge number for fighting and tiny number for hunting. Caveman decide must be because square distance. Caveman decide to try square root of variance. Caveman great inventor of statistical measures!!

Standard deviation
The infamous standard deviation is the square root of the variance. In other words, the standard deviation is in some sense a measure of the average distance from the mean. Mathematically, it is often expressed as the Latin lower-case letter \(s\)99 And this, then, is why we can write \(V = s^2\), because \(s = \sqrt{V}\)., or the greek lower-case letter \(\sigma\).
import math

def s(x):
    return math.sqrt(V(x))

print("Hunting spread (standard deviation): %.2f" %
    s(hunting_sleep))
print("Fighting spread (standard deviation): %.2f" %
    s(fighting_sleep))
Hunting spread (standard deviation): 1.07
Fighting spread (standard deviation): 6.09

Caveman think this look fair. Caveman make note that this time, standard deviation and interquartile range were very similar. Caveman think maybe general pattern? But not sure.

Writing last definition, caveman almost scribble over heart rate data on mountain side. Caveman almost give self heart attack. Just for safe, caveman try to summarise resting heart rate data using all new learnings.

print('Mode    Median      Mean')
print('  %d        %d     %.2f' % (
    M(resting_hr),
    md(resting_hr),
    mean(resting_hr)
))
print('')
print('Var.width  IQR  Variance  Std.dev.')
print('       %d    %d     %.2f      %.2f' % (
    varwidth(resting_hr),
    iqr(resting_hr),
    V(resting_hr),
    s(resting_hr)
))
Mode    Median      Mean
  72        69     68.35

Var.width  IQR  Variance  Std.dev.
       26    7     23.30      4.83

Caveman very curious man. Caveman make experiment.

HR Occurrences Measure
55 I  
56    
57 I  
58 I  
59    
60 IIIII  
61 I  
62 II  
63 IIIIIII  
64 IIIIIIII 🡄 mean - std. dev.
65 IIIII  
66 IIIIII  
67 IIIII  
68 IIII 🡄 mean
69 IIIIIIIIIIII 🡄 median
70 IIIIIIIII  
71 IIIIIIII  
72 IIIIIIIIIIIII 🡄 mode
73 IIIIIII 🡄 mean + std. dev.
74 I  
75 IIIIII  
76 II  
77    
78    
79 II  
80    
81 I  

Caveman make quick mountain side maths. How many sample lie inside \(\bar{x} \pm s\)?

print('%.0f %%' % (100 * sum(count for (xi, count)
          in Counter(resting_hr).items()
          if 63 < xi < 74
      ) / len(resting_hr)))
73 %

Caveman make mental note: many sample closer to mean than average distance to mean.