Two Wrongs

Reading Notes: The Elements of Statistical Learning

Reading Notes: The Elements of Statistical Learning

I have an endless fascination for statistics and machine learning and all that stuff. I’m also not very good at it. In order to truly, actually learn something, I need to wrestle with the fundamentals of it for a while.

So here we go. My gracious boss lended me his personal copy of esl11 The Elements of Statistical Learning; Data Mining, Inference, and Prediction; Hastie, Tibshirani & Friedman; Springer; 2001. and I’ll slowly try to work my way through at least the most interesting parts of it.


I’m not sure yet what data I’ll use to support my explorations, but there are a few things I’ve been interested in finding out. One of them relates to the distribution of “likes” on the photos posted by the family’s dogs on Instagram.

I have manually combed through the first few photos and sort of assigned yes/no values to some categories I thought may be relevant, relating to the contents of each photo.


monsters <- read_delim(

They’re surprisingly popular, actually!

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
62.0    86.5   103.0   100.2   112.5   139.0


They use the following notation in the book:

  • Input variables are written as uppercase \(X\). If the input variable is a vector, its components (the predictor values) can be accessed by subscript \(X_j\).
  • Actual observations of input variables are written in lowercase, so \(x\) is an observation of \(X\). Often we have more than one observation, and we refer to a specific one as \(x_i\).
  • The number of inputs is \(N\), so \(0 < i \gte N\), and the number of predictors is \(p\), so \(0 < j \gte p\).
  • The \(N \times p\) matrix \(\mathbf{X}\) is a set of \(N\) input \(p\) vectors \(x_i\).
  • Bold lettering is also used for the $N$-vector of all observations on the variable (predictor) \(X_j\), which is written \(\mathbf{x}_j\).
  • Quantitative (continuous) output variables are written \(Y\) (and observations of these, lowercase \(y\).)
  • Qualitative output variables (categories) are written \(G\) (and observations of these, lowercase \(g\).)

Statistical Decision Theory

What is a good prediction? We want to minimise some sort of error. The “by far most common and convenient” loss function to minimise is the squared error loss, \((Y - f(X))^2\), where \(f\) is our prediction function. This function leads to an expected prediction error that depends on the probability distributions of \(X\) and \(Y\), expressed in the book as22 I’m writing this down because it’s actually a rather intuitive expression for me

\[ \mathrm{EPE}(f) = \int (y - f(x))^2 \mathrm{Pr}(dx, dy) \]

The authors solve for \(f\) and arrive at

\[ f(x) = E(Y \middle| X = x) \]

which is a neat result because it’s so obvious. It’s saying that the optimal prediction function will predict the expected value of \(Y\) given the observation \(x\). Of course it will!

We can use the results we just arrived at to invent an algorithm.

Simple Model 1: Expectancy Approximation

We don’t know the expected value of \(Y\), unfortunately.33 If we did, we wouldn’t need to bother with all this statistical learning stuff. However, the expected value of \(Y\) when \(X=x\) can be approximated by looking in our training data at the average of \(y_i\), where \(i\) is such that \(x_i = x\).

We’ll try this model on the monsters data, but at first only concerning ourselves with the first two predictors. How well can they predict the number of likes?

limited <- monsters %>% select(Likes, Lucy, Caspian)

To approximate the expected value of Likes given information on the presence of either dog in the data, we need to compute the average value of the output for each location in predictor space (i.e. for each existing combination of predictor values).

limited %>%
    group_by(Lucy, Caspian) %>%
0	1	102.037037037037
1	0	98.5
1	1	99.7142857142857

Is this a good model? To figure that out,

Simple Model 2: K-Nearest Neighbours

Simple Model 3: Linear Least Squares

Model Formalism

Before we can start working with this in a sensible way in R, we need to move past our informal, handwavy descriptions of models, and start writing them as proper R code.

We will say that a model is a function that takes some data, and returns an object that can later be used to predict new values from input data. Here’s an example using the simple model from above.

universal <- function(formula, data) {
    inputs <- labels(terms(formula))
        inputs = inputs,
        trained = data %>%
            select(all.vars(formula)) %>%
            mutate_(output=all.vars(formula)[1]) %>%
            group_by_(.dots=inputs) %>%
    ), class = "universalClass")

predict.universalClass <- function(model, data) {
    data %>%
        select(model$inputs) %>%
        mutate(output=filter(model$trained, ))


Model Evaluation: K-fold Cross Validation

We spoke before about the loss function, and we put it in mathematical terms. I’d also like to write it down in R, because it’s going to be useful to us. The squared error loss is trivially implemented as

L <- function(real, predicted) (real - predicted)^2

This is very little data, and I’m going to be reusing it a bunch of times for different experiments. This is a big no-no to the scientist in me 44 Every time I reuse the data, the probability of me encountering a significant-looking result purely by random chance increases., so I’m going to use a technique called k-fold cross validation. I’m not yet sure why it works 55 It looks like I might learn that later in the book! but it is an established technique.

We have


rows in our data set. We want to group this data into smaller sets of \(k\) rows, where each set should still probably, maybe, look like the larger set. We’ll pick a convenient \(k=3\), giving us three groups, called folds, of 25 observations in each.

This function was a bit tricky to write, primarily because we want to do it in a way that was as natural to R as possible with my knowledge of the language.

We randomise the order of the \(N\) indices of data and split them up into a matrix of \(k\) columns. In other words, each column in the folds matrix is a fold, i.e. a $k$-sized sub-group of the data available. To evaluate each fold, we compute a boolean mask for whether or not an element is part of that fold, and then use it as a training or testing sample depending on its status.

From the training samples, we try to predict values of the testing samples, and then we compute the loss associated with the prediction in relation to the real values.

kfold <- function(data, k, model, predict, real, loss=L) {
    N <- nrow(data)
    folds <- matrix(sample(1:N), ncol=k)
    evaluate <- function (fold) {
        mask <- 1:N %in% fold
        training <- data[!mask,]
        testing <- data[mask,]
        predicted <- predict(model(training), testing)
        loss(predicted, real(testing))
    errors <- apply(folds, 2, evaluate)

You’ll see that this function takes four parameters other than the data and value for \(k\) we want to use. In order, they are these:

  • model is a function that takes some data and returns a model that is trained on the given data, e.g. ~~function(data) mean(data$Likes)~~.
  • predict is a function that takes a trained model and some testing inputs, and – unsurprisingly – returns predicted values for that training data. It might look like function(model, data) model + 5.
  • real is a function that takes some data and extracts the real outputs from it, e.g. function(data) data$Likes.
  • loss is a function that takes a prediction and the corresponding real values, and computes some sort of penalty value for errors in the prediction. The squared error loss is so common we’ll use it as the default here.

Given this, we can finally evaluate our