# Reading Notes: The Elements of Statistical Learning

I have an endless fascination for statistics and machine learning and all that stuff. I’m also not very good at it. In order to truly, actually learn something, I need to wrestle with the fundamentals of it for a while.

So here we go. My gracious boss lended me his personal copy of esl^{1}^{1} *The
Elements of Statistical Learning; Data Mining, Inference, and Prediction*;
Hastie, Tibshirani & Friedman; Springer; 2001. and I’ll slowly try to work my
way through at least the most interesting parts of it.

# Data

# Table of Contents

I’m not sure yet what data I’ll use to support my explorations, but there are a few things I’ve been interested in finding out. One of them relates to the distribution of “likes” on the photos posted by the family’s dogs on Instagram.

I have manually combed through the first few photos and sort of assigned yes/no values to some categories I thought may be relevant, relating to the contents of each photo.

library(tidyverse) monsters <- read_delim( "https://two-wrongs.com/src/aux/belovedmonsters.csv", delim="," )

They’re surprisingly popular, actually!

summary(monsters$Likes)

Min. 1st Qu. Median Mean 3rd Qu. Max. 62.0 86.5 103.0 100.2 112.5 139.0

# Notation

They use the following notation in the book:

- Input variables are written as uppercase \(X\). If the input variable is a vector, its components (the predictor values) can be accessed by subscript \(X_j\).
- Actual
*observations*of input variables are written in lowercase, so \(x\) is an observation of \(X\). Often we have more than one observation, and we refer to a specific one as \(x_i\). - The number of inputs is \(N\), so \(0 < i \gte N\), and the number of predictors is \(p\), so \(0 < j \gte p\).
- The \(N \times p\) matrix \(\mathbf{X}\) is a set of \(N\) input \(p\) vectors \(x_i\).
- Bold lettering is also used for the $N$-vector of all observations on the variable (predictor) \(X_j\), which is written \(\mathbf{x}_j\).
- Quantitative (continuous) output variables are written \(Y\) (and observations of these, lowercase \(y\).)
- Qualitative output variables (categories) are written \(G\) (and observations of these, lowercase \(g\).)

# Statistical Decision Theory

What is a good prediction? We want to minimise some sort of error. The “*by far
most common and convenient*” loss function to minimise is the *squared error
loss*, \((Y - f(X))^2\), where \(f\) is our prediction function. This function
leads to an expected prediction error that depends on the probability
distributions of \(X\) and \(Y\), expressed in the book as^{2}^{2} I’m writing this
down because it’s actually a rather intuitive expression for me

\[ \mathrm{EPE}(f) = \int (y - f(x))^2 \mathrm{Pr}(dx, dy) \]

The authors solve for \(f\) and arrive at

\[ f(x) = E(Y \middle| X = x) \]

which is a neat result because it’s so obvious. It’s saying that the optimal prediction function will predict the expected value of \(Y\) given the observation \(x\). Of course it will!

We can use the results we just arrived at to invent an algorithm.

## Simple Model 1: Expectancy Approximation

We don’t know the expected value of \(Y\), unfortunately.^{3}^{3} If we did, we
wouldn’t need to bother with all this statistical learning stuff. However,
the expected value of \(Y\) when \(X=x\) can be approximated by looking in our
training data at the average of \(y_i\), where \(i\) is such that \(x_i = x\).

We’ll try this model on the `monsters`

data, but at first only concerning
ourselves with the first two predictors. How well can they predict the number
of likes?

limited <- monsters %>% select(Likes, Lucy, Caspian)

To approximate the expected value of `Likes`

given information on the presence
of either dog in the data, we need to compute the average value of the output
for each location in predictor space (i.e. for each existing combination of
predictor values).

limited %>% group_by(Lucy, Caspian) %>% summarise(avg=mean(Likes))

0 1 102.037037037037 1 0 98.5 1 1 99.7142857142857

Is this a good model? To figure that out,

## Simple Model 2: K-Nearest Neighbours

## Simple Model 3: Linear Least Squares

# Model Formalism

Before we can start working with this in a sensible way in R, we need to move past our informal, handwavy descriptions of models, and start writing them as proper R code.

We will say that a model is a function that takes some data, and returns an object that can later be used to predict new values from input data. Here’s an example using the simple model from above.

universal <- function(formula, data) { inputs <- labels(terms(formula)) structure(list( inputs = inputs, trained = data %>% select(all.vars(formula)) %>% mutate_(output=all.vars(formula)[1]) %>% group_by_(.dots=inputs) %>% summarise(mean=mean(output)) ), class = "universalClass") } predict.universalClass <- function(model, data) { data %>% select(model$inputs) %>% mutate(output=filter(model$trained, )) mutate_(output=) }

## Model Evaluation: K-fold Cross Validation

We spoke before about the loss function, and we put it in mathematical terms. I’d also like to write it down in R, because it’s going to be useful to us. The squared error loss is trivially implemented as

L <- function(real, predicted) (real - predicted)^2

This is very little data, and I’m going to be reusing it a bunch of times for
different experiments. This is a big no-no to the scientist in me ^{4}^{4} Every
time I reuse the data, the probability of me encountering a
significant-looking result purely by random chance increases., so I’m going
to use a technique called *k-fold cross validation*. I’m not yet sure why it
works ^{5}^{5} It looks like I might learn that later in the book! but it is an
established technique.

We have

nrow(monsters)

75

rows in our data set. We want to group this data into smaller sets of \(k\)
rows, where each set should still probably, maybe, look like the larger set.
We’ll pick a convenient \(k=3\), giving us three groups, called *folds*, of 25 observations in
each.

This function was a bit tricky to write, primarily because we want to do it in a way that was as natural to R as possible with my knowledge of the language.

We randomise the order of the \(N\) indices of `data`

and split them up into a
matrix of \(k\) columns. In other words, each column in the `folds`

matrix is a
fold, i.e. a $k$-sized sub-group of the data available. To evaluate each
fold, we compute a boolean mask for whether or not an element is part of that
fold, and then use it as a training or testing sample depending on its
status.

From the training samples, we try to predict values of the testing samples,
and then we compute the *loss* associated with the prediction in relation to
the real values.

kfold <- function(data, k, model, predict, real, loss=L) { N <- nrow(data) folds <- matrix(sample(1:N), ncol=k) evaluate <- function (fold) { mask <- 1:N %in% fold training <- data[!mask,] testing <- data[mask,] predicted <- predict(model(training), testing) loss(predicted, real(testing)) } errors <- apply(folds, 2, evaluate) mean(errors) }

You’ll see that this function takes four parameters other than the data and value for \(k\) we want to use. In order, they are these:

`model`

is a function that takes some data and returns a model that is trained on the given data, e.g.`~~function(data) mean(data$Likes)~~`

.`predict`

is a function that takes a trained model and some testing inputs, and – unsurprisingly – returns predicted values for that training data. It might look like`function(model, data) model + 5`

.`real`

is a function that takes some data and extracts the real outputs from it, e.g.`function(data) data$Likes`

.`loss`

is a function that takes a prediction and the corresponding real values, and computes some sort of penalty value for errors in the prediction. The squared error loss is so common we’ll use it as the default here.

Given this, we can *finally* evaluate our