# Reading Notes: The Elements of Statistical Learning

I have an endless fascination for statistics and machine learning and all that stuff. I’m also not very good at it. In order to truly, actually learn something, I need to wrestle with the fundamentals of it for a while.

So here we go. My gracious boss lended me his personal copy of esl11 The Elements of Statistical Learning; Data Mining, Inference, and Prediction; Hastie, Tibshirani & Friedman; Springer; 2001. and I’ll slowly try to work my way through at least the most interesting parts of it.

# Data

I’m not sure yet what data I’ll use to support my explorations, but there are a few things I’ve been interested in finding out. One of them relates to the distribution of “likes” on the photos posted by the family’s dogs on Instagram.

I have manually combed through the first few photos and sort of assigned yes/no values to some categories I thought may be relevant, relating to the contents of each photo.

library(tidyverse)

"https://two-wrongs.com/src/aux/belovedmonsters.csv",
delim=","
)


They’re surprisingly popular, actually!

mutate(output=filter(model$trained, )) mutate_(output=) }  ## Model Evaluation: K-fold Cross Validation We spoke before about the loss function, and we put it in mathematical terms. I’d also like to write it down in R, because it’s going to be useful to us. The squared error loss is trivially implemented as L <- function(real, predicted) (real - predicted)^2  This is very little data, and I’m going to be reusing it a bunch of times for different experiments. This is a big no-no to the scientist in me 44 Every time I reuse the data, the probability of me encountering a significant-looking result purely by random chance increases., so I’m going to use a technique called k-fold cross validation. I’m not yet sure why it works 55 It looks like I might learn that later in the book! but it is an established technique. We have nrow(monsters)  75  rows in our data set. We want to group this data into smaller sets of $$k$$ rows, where each set should still probably, maybe, look like the larger set. We’ll pick a convenient $$k=3$$, giving us three groups, called folds, of 25 observations in each. This function was a bit tricky to write, primarily because we want to do it in a way that was as natural to R as possible with my knowledge of the language. We randomise the order of the $$N$$ indices of data and split them up into a matrix of $$k$$ columns. In other words, each column in the folds matrix is a fold, i.e. a$k$-sized sub-group of the data available. To evaluate each fold, we compute a boolean mask for whether or not an element is part of that fold, and then use it as a training or testing sample depending on its status. From the training samples, we try to predict values of the testing samples, and then we compute the loss associated with the prediction in relation to the real values. kfold <- function(data, k, model, predict, real, loss=L) { N <- nrow(data) folds <- matrix(sample(1:N), ncol=k) evaluate <- function (fold) { mask <- 1:N %in% fold training <- data[!mask,] testing <- data[mask,] predicted <- predict(model(training), testing) loss(predicted, real(testing)) } errors <- apply(folds, 2, evaluate) mean(errors) }  You’ll see that this function takes four parameters other than the data and value for $$k$$ we want to use. In order, they are these: • model is a function that takes some data and returns a model that is trained on the given data, e.g. ~~function(data) mean(data$Likes)~~.
• predict is a function that takes a trained model and some testing inputs, and – unsurprisingly – returns predicted values for that training data. It might look like function(model, data) model + 5.
• real is a function that takes some data and extracts the real outputs from it, e.g. function(data) data\$Likes.
• loss is a function that takes a prediction and the corresponding real values, and computes some sort of penalty value for errors in the prediction. The squared error loss is so common we’ll use it as the default here.

Given this, we can finally evaluate our