# Mathematical Statistics: Reading Notes

I have a book on statistics^{1}^{1} *Matematisk Statistik*; Kerstin Vännman;
Studentlitteratur; 1990. which I have inherited from my father^{2}^{2} Or my
mother. Based on the handwriting, they have both made notes in the margin.. It
is one of those old things that are very obviously written on a typewriter with
blank spaces where they drew all weird symbols and graphs by hand.

This book is a *really, really good intro to statistics*. It is incredibly
clear, systematic and well-illustrated. I have still not found a statistics text
of comparable useability. Here are some things I have read in it that I think
are important.

# Typographical and Notational Conventions

There’s a clear distinction throughout the book between what’s “real” and what’s “theoretical”. The theoretical is a lot about what we expect out of random processes, and the “real” is about the data points actually measured. There are some correspondences between the two that help with understanding formulas later in the book.

For example, a *stochastic variable* is the name we give a theoretical amount
that gets its value from a random process (or at least is modeled that way.)
These are written with Greek letters: \(\xi\), \(\eta\), \(\zeta\). When we have
performed an experiment and measured an actual value for this variable, we call
that an *observation* and denote it with corresponding Latin letters: \(x\), \(y\),
\(z\).

I have tried to express these correspondences more succinctly in the table below.

Description | Real | Theoretical |
---|---|---|

Value is called | Observation | Stochastic variable |

Variable uses symbols | \(x\), \(y\), \(z\), … | \(\xi\), \(\eta\), \(\zeta\), … |

Average result | Arithmetic mean \(\bar{x}\) | Expected value \(\mu = \mathrm{E}(\xi)\) |

Dispersion (average difference between observed and mean) | Observed variance \(s\), Observed std. dev. \(s^2\) | Variance \(\mathrm{V}(\xi) = \mathrm{E}((\xi - \mu)^2)\), Standard deviation \(\sigma = \sqrt{\mathrm{V}(\xi)}\) |

Point estimates | \(\theta_{\mathrm{obs}}^*\) | \(\theta^*\) |

In this case, \(\theta\) represents any variable – in practise, this is often the expected value or standard deviation.

# Common Distributions

## Discrete

- Binomial distribution
- If you have an experiment with a yes-or-no outcome, where both results happen with the same probability every time, and you repeat that experiment \(n\) times, the number of times you’ll get the result with probability \(p\) is \(\mathrm{Bin}(n, p)\).
- Hypergeometric distribution
- \(\mathrm{Hyp}(N, n, p)\) essentially represents
a situation similar to the binomial case, except \(p\) represents the
*initial*probability, and the probability is adjusted after each experiment depending on \(N\) and the outcome. (Think: drawing from an urn without returning the drawn objects.) - Poisson distribution
- In a situation where independent events happens at random intervals, the actual number of events in a time period is \(\mathrm{Po}(\lambda)\), where \(\lambda\) represents the average number of events in that time period.

## Continuous

- Exponential distribution
- Similar to the Poisson distribution, \(\mathrm{Exp}(\lambda)\) is the time you have to wait between independent events that happen at random intervals.
- Normal distribution
- Probably the most well-known continuous distribution, which gets its own chapter in the book. There are so many situations that can be modeled with the normal distribution that I’m not even going to try to give an example. Notation: \(\mathrm{N}(\mu, \sigma)\), where \(\mu\) is the expected value and \(\sigma\) is the standard deviation.

# Normal Distribution

The cumulative distribution function for the “standard” normal distribution \(\mathrm{N}(0,1)\) is notated \(\Phi\) and its values can be found in tables. To figure out the probability of getting at least a measurement \(x\) for some normal distribution, i.e. \[\mathrm{P}(\xi \le x | \xi \in \mathrm{N}(\mu, \sigma))\] you’ll have to look up \[\Phi\left(\frac{x-\mu}{\sigma}\right)\] in a cdf table for \(\mathrm{N}(0,1)\).

Even if we draw samples from a distribution that is not normal, we can in many cases approximate the distribution as normal. For example, the following distributions can be approximated as normal given that the condition is met.

Distribution | Condition | Approximation | Substituation |
---|---|---|---|

\(\mathrm{Bin}(n, p)\) | \(np(1-p) > 10\) | \(\mathrm{N}(\mu, \sigma)\) | \(\mu=np\), \(\sigma=\sqrt{np(1-p)}\) |

\(\mathrm{Po}(\lambda)\) | \(\lambda > 15\) | \(\mathrm{N}(\mu, \sigma)\) | \(\mu=\lambda\), \(\sigma=\sqrt{\lambda}\) |

\(\mathrm{Bin}(n, p)\) | \(n > 10\), \(p < 0.1\) | \(\mathrm{Po}(\lambda)\) | \(\lambda=np\) |

\(\mathrm{Hyp}(N, n, p)\) | \(n/N < 0.1\) | \(\mathrm{Bin}(n, p)\) | \(n=n\), \(p=p\) |

When we have independent stochastic variables from any distribution with expected value \(\mu\) and standard deviation \(\sigma\), the central limit theorem says that we can approximate their sum as \(\mathrm{N}(n\mu, \sigma\sqrt{n})\). Neat!

# Interval Estimation

If we run an experiment a bunch of times and get a bunch of numbers as a result,
and we want to know what number we can expect when running the experiment, we
can perform an *interval estimation*. This is where we say, “In 99% of the
experiments, the number will end up within this interval.”

We’ll run with a silly example. Let’s pretend that I have timed myself writing
articles for this blog^{3}^{3} Hey, I should start doing that. Needless to say, I
have not, and these numbers come from the depths of my bottom., and these are
the times it took to write 40 different articles.

19 | 32 | 31 | 28 | 24 | 21 | 21 | 33 | 55 | 14 | 29 | 44 | 16 | 16 | 30 | 26 | 31 | 24 | 28 | 39 |

23 | 47 | 25 | 17 | 43 | 15 | 27 | 21 | 37 | 45 | 21 | 26 | 31 | 16 | 17 | 36 | 24 | 27 | 17 | 20 |

Using only this data, we want to figure out a reasonable expectation for the
mean value of this process. Let’s say “reasonable” in this case means 95 %
confidence interval. To compute a confidence interval estimation, we need a
*point estimation* of the expected value and standard deviation.

We know that we can estimate the expected value through

\[\mu^*_{\mathrm{obs}} = \bar{x} \approx 27.4\]

We calculate a point estimation of the standard deviation as well, by first crunching the average square distance from 27.4, and then taking the square root of that.

\[\left(\sigma^*_{\mathrm{obs}}\right)^2 = s^2 = \frac{\sum_i (27.4 - x_i)^2}{40}\]

which yields \(s \approx 9.8\).

We could say “The expected value is \(27.4 \pm 9.8\)!” but that is a very
imprecise estimation^{4}^{4} That’s basically running the experiment a bunch of
times and saying “I know! The true value can be any of these values *we actually
received*”. Well, of course it can. and we can do better.

In order to convert this into an estimation with a 95 % confidence, we use a \(t\) distribution table. We look up

- the row for \(v=39\) because the degrees of freedom is one less than the number of samples in the observation; and
- the column of our desired confidence, which is going to be 0.975. We are calculating with \(1-\alpha/2 = 1-0.025\) because the confidence interval is two-sided.

Unless I’m reading things wrong, the table tells us that

\[t^* = t_{0.0975,39} = 2.021.\]

Finally, the 95% confidence interval estimation is then computed as

\[\mu^*_{obs} \pm t^* \frac{\sigma^*_{\mathrm{obs}}}{\sqrt{n}}\]

or, in our case,

\[27.4 \pm 2.021 \frac{9.8}{6.32} \approx 27.4 \pm 3.13\]

So there we go! The expected value lies between 24.3 and 30.5 with a 95 % probability.