Two Wrongs

Statistician’s Time Series Hack

Statistician’s Time Series Hack

The tl;dr of this article is: to model latent common causes without having to list them explicitly, condition on the previous observation.

Let’s dig into what that means.

I’m learning more about models that handle autocorrelation

In a previous article on forecasting sea ice, I was dismissive of statistical models that account for autocorrelation, on the grounds that I don’t know them by heart. That was unfair of me, and I’m making up for it by learning them.

The book I’m reading1 Time Series Analysis; Hamilton; Princeton University Press; 1994. opens up with first order difference equations, which are extremely simple models based on the idea that an observation \(y_t\) depends only on the previous observation \(y_{t-1}\) and some other … stuff unique to this observation. In the notation of the book,

\[y_t = \phi y_{t-1} + x_t\]

The \(x_t\) can simply be random noise, but it can also be built up from a multitude of other contributing causes. For example, this equation governs the 5-minute load indicator in uptime, and in that case \(x_t\) is one variable that rolls up everything everythng that uses cpu on the system.

Why would a book on time series open up with a specific kind of time series, namely first-order difference equations?2 Aside from the fact that they serve as a useful pedagogical vehicle.

First-order difference equations actually also show up everywhere. Or rather, if you’re lazy, you can make them show up everywhere and be better off for it. This is due to a funny hack we can apply when analysing data, which I learned somewhere else but haven’t written about before.

Example: solar panels that depend on wind

Let’s say we have covered our roof with solar panels, and we’re not getting as much out of the midday sun as the manufacturer promised, so we’ve paid a great deal of money to switch to panels from another manufacturer, but we have the same problem again.

We start to suspect that maybe when it’s very windy, the branches of a nearby tree sway in the wind such that they partially shade our solar panels. This is a testable hypothesis! We measure wind speed and solar panel output every four days over a year, and then we can plot output against wind speed:

Sorry, your browser does not support SVG.

Just quickly eyeballing this, it doesn’t seem like wind speed has a significant effect on solar panel output. Indeed, the correlation is a weak -0.11, and the p-value of a linear regression coefficient is 0.29.

In intuitive terms: the data has too much noise to say anything about the relationship between solar panel output and wind speed. We are faced with two tempting choices here:

  1. Declare failure and conclude that wind does not significantly affect solar panel output. This could be a mistake.3 Maybe wind does not have a large effect compared to the noise, but we can’t control the noise, while we can control the effect of wind (by taking down that tree.) We shouldn’t be looking for the largest contributors, only the largest contributors we can do something about.
  2. Try to reduce noise by smoothing the data. This is almost always a mistake.4 Smoothing data will always improve correlations, but then the correlation we’re seeing is no longer valid for our data.

But wait, let’s back up a bit.

Noise is just alternative explanations

What is noise, really? It’s causes that contribute to the observation in ways we haven’t controlled for. There are many things that affect solar panel output, like season, precipitation, heat, cloudiness, technology level, and more. Some of these are more important (whether it’s overcast), others less so (heat). But in aggregate, they contribute more to solar panel output than wind does, so their effect hides any effect of wind we might want to learn.

We could try to control for all of these alternative explanations, and then maybe wind would re-emerge as meaningful at that point, but it sounds like an annoying job to hunt down and control for each of them.

What if there’s a better way? Assume that most of those factors have similar effects on adjacent observations. More concretely, if day 8 is overcast, then maybe 9 is also somewhat overcast, and day 10 also maybe a little. Taking it further, maybe if day 8 has bad solar output, then we probably shouldn’t expect day 9 to have great solar output, and our expectations for day 10 should also be a little lower.

This is starting to sound like an first-order difference model, because it is!

In other words, we are using the mother of all proxies: control for common factors by including \(y_{t-1}\) as a variable that affects (in scare quotes) \(y_t\). This is the hack they don’t want you to know. I find it very funny and at the same time incredibly useful.

Fitting a first-order difference model to the data

We can formally model this as the output \(y\) at time \(t\) following the relationship

\[y_t = \phi y_{t-1} + r w_t + x_t\]

Here, \(\phi\) is how much the previous observation can explain the current one, \(r\) is how the wind speed \(w_t\) affects the output, and \(x_t\) is a noise term that accounts for all variation not explained by the previous measurement and wind speed.

By making somewhat unfounded assumptions about distributions and biases, we can fit this with maximum likelihood estimation, where we just try a bunch of parameters and see which ones make our observations most likely. If we do on our solar panel data, we get the parameters5 Actually, there’s a third parameter which is the \(\sigma^2\) used for the \(x_t\) term, otherwise assumed unbiased. But that’s not particularly interesting at this point other than as a sanity check.

\[\phi = 0.7\] \[r = -30\]

The part we care about here is that it tells us each unit of additional wind speed seems to, on average, reduce solar panel output by 30 units.

I like linear regression

But on the other hand, we’re only doing this to find out the value of \(r\). We don’t need the fully fitted first-order difference model. We can get by with something much simpler: linear regression. We ask R for a line that fits against two explanatory variables:

  • output lagged by one, and
  • wind speed.

It helpfully tells us what we already knew:

  Coefficient p-value
Wind speed -27 0.0098
Lagged output 0.69 0

In other words, wind speed has roughly an effect of -30, and the coefficient is Wald significant at the 1 % level. But it’s nice to see it confirmed through another method.

Here’s how well the model fits the data it was trained on:

Sorry, your browser does not support SVG.

This looks great! But that shouldn’t come as a surprise. Due to the nature of the model, the predicted value at time \(t+1\) will be a fantastic match for what happened at \(t\). When we look at a plot like this, our eyes fool us into thinking both lines are very close, when in fact they are constantly separated by one unit of time.6 That said, the model does follow the general curve over the year, which is not something to sneer at. Points at the power of the “same as yesterday” forecast.

As much as I’m in favour of looking at plots of the real data, in this case, it would be more meaningful to look at a plot of the deviations between model and data, to get a better sense for the accuracy. We’ll get to that.

Next step is verification

Eyeball statistics aside, all we have speaking for the validity of our model is a significant p-value of the linear regression parameter. That is not proof of anything. The fit does improve when we add wind as a factor, but this could just as well be due to overfitting. The real test of any model is cross-validation evaluated against the null model, which excludes the effect of wind, to see which model has better predictive power.

It turns out the model that accounts for wind achieves slightly but consistently worse predictive power, with a squared error about 6 % higher. To get a sense of why, let’s plot the deviations between the real data and the prediction. The blue points are predictions from the null model (the one that does not account for wind) when predicting against a subset of the data it was not trained on.

Sorry, your browser does not support SVG.

As we can imagine, the error is smaller during the times when there were fewer extreme movements in the data. Anyway, this was plotting the null hypothesis, i.e. the one that does not account for wind. Try to guess what happens when we add in red dots for wind.

Sorry, your browser does not support SVG.

It may be easy to miss if you don’t know what to look for, but the red points are generally more widely scattered than the blue ones, without being noticeably closer to the midpoint. This confirms our suspicion: the model that includes wind has added more noise without a meaningful improvement of accuracy. In other words, it is overfit.

Predictive power is not size of contribution

At this point, we have one measure (the Wald test on the linear model coefficients) that says wind has a significant effect of about 30, and another that says wind lacks predictive power (cross-validation of the same). What’s the answer? Remember the discussion from before: what we really want to know is what improvement we would have on output if we could remove the effect of wind (by cutting down a tree.) So what matters here is not the predictive power of the wind-inclusive model, but rather what the effect of wind is when we include it in model.

How close is 30 to the true value of the effect of wind?

Fortunately for us, the data we have used in this article is artificially constructed, so we can recover the exact effect of wind in this case: on average across the data set it’s -36 output per unit of wind speed.7 Is -36 large enough to chop down a tree? Depends on the economies of the situation, and that question is outside the scope of this analysis.

Fairly good for a lazy person’s hack!

Addendum: Yes, this is autoregression

A few readers have expressed surprise at this article, saying, “Aren’t we just talking about an autoregressive model?”

Yes. But the way I’ve found autoregressive models explained most often is that they should be employed when there exists some causal mechanism that prevents the value at time \(t\) from diverging very much from the value at time \(t-1\). This is not what this article proposes – rather, it uses the first-order autoregression predictively8 I don’t remember who wrote about predictive vs analytic/structural use of statistics but this is a reference to that discussion. just for the information carried forward by the previous measurement. It exploits the correlation without making any assumptions about causation. That’s the main point I wanted to make.