# Retrospective: Metaculus Quarterly Cup

I have participated in the Metaculus Quarterly Cup these past three months. This
is a tournament that runs over just three months^{1} As you might guess from the
name and contains only short-fuse questions, i.e. those that resolve within the
three months – a few of them as quickly as days after opening.

This has given me a chance to fairly evaluate my forecasting abilities. Interesting and humbling! And also a lot of fun. Big thanks to the organisers.

# Just the numbers

My average blind Brier score in the tournament was 0.18.^{2} Blind in the sense
that this score is based on forecasts I made that while the community prediction
was hidden to me, meaning I couldn’t be influenced by it.

This is just about what I would expect for someone in my situation.

# I didn’t do very well compared to the community

There were over 800 people participating in the tournament, and I ended up on the leaderboard in the top 50. This sounds more impressive than it is, because almost all participants did not try to answer most of the questions, so anyone who tries can score higher than 700-something people just by showing up.

To get a sense of what a score really means compared to people who actually
tried, we can filter out all people who had a participation rate below
25 %^{3} Why 25 %? Eyeballing the score distribution, this seems like the
threshold that includes mainly people who actually tried. In particular, once we
include people with participation lower than 25 %, the distribution of
participation rates in the area around 0 log score leans overwhelmingly toward
people who just answered a few questions and didn’t really have a chance to
gather much score at all. and then order by log score. Among this lot of 78
people who actually tried, I placed 35. Not great; 56th percentile.

That said, my average log score per question was -0.036, which I am absolutely
happy with. It puts me around the 65th percentile among those who
tried.^{4} What does it mean that someone compares more favourably in mean log
score per question than total log score? It means that some participants got a
better score by simply not forecasting on some of the more difficult questions,
whereas I made a point of forecasting every question – even the ones I found
really difficult.

More importantly, that average score isn’t *that* far from zero – it’s just a
really tough competition: a regular person with limited training^{5} I.e. me.
should not expect to get a log score higher than negative peanuts in a Metaculus
tournament.

Interestingly, my Brier score improved during the course of the tournament. It was 0.23 during the first third of the tournament, and improved to 0.15 for the rest. It remains to be seen whether this was luck, easier questions, or actual learning happened.

## Metaculus tournament log score is relative to the community median forecast

In case you’re involved with forecasting but not Metaculus, I’ll point something non-obvious out: the tournament log score on Metaculus is relative to the community median forecast, the “wisdom of the crowd”. Critically, the wisdom of the crowd performs at the level of a superforecaster, so you practically have to be an above-average superforecaster to get a positive log score in a Metaculus tournament. This is why the vast majority participating get a negative log score.

I will keep writing “log score”, but keep in mind that it means Metaculus tournament log score which is relative to the community median forecast.

# More Statistics

One of the main places where I tanked my score was non-binary questions, i.e. questions where you need to forecast an entire probability distribution. I was way too aggressive on those, and often postulated that fully plausible outcomes were impossible. This is apparent in my average log score for each type of question:

Binary | -0.026 |

Continuous | -0.073 |

By looking at a histogram of the scores split up into the two types of question, we can also see that the continuous questions have much wider variance, i.e. they were wilder guesses on my part.

Note that (as is clear from the histogram) there were only a few continuous
questions – yet their contribution to my log score (-0.872 in total) was almost
as great as that of all binary questions together (-0.997). This means that if I
had refused to answer any continuous questions, my score would have been 73rd
percentile among those who tried, rather than 56th percentile.^{6} Though the
lesson I take away from this is to get better at continuous questions, not
refuse to answer them!

Because of the wide spread of the continuous questions, I have removed them from the data in the following analysis, to make the picture more clear.

Another way to dissect the results is based on which forecasts I modeled statistically, and which ones I did not. My average score is much higher for the questions I modeled statistically than the others.

Modeled statistically | +0.09 |

Went by gut feel | -0.06 |

Clearly, statistical modeling is where I have an edge over the community. In not a single question where I did model my forecast statistically did I perform worse than the community. Unfortunately, I only had time to lay out models for a few questions.

However, I think I could have applied some light modeling to more questions than I did, and I think I could have benefited from that. See the next section for a concrete example.

# Lessons Learned

## Tour de France stage winner (+0.246)

Many of the community members explaining their reasoning on this question made references to the individual cyclists, their strengths, the nature of the course, and so on. I took a completely different probabilistic approach, and, well, as I wrote in a comment to the question:

Interesting to see people take somewhat of an inside view. I wanted to get a sense for the more general problem so I modeled each cyclist as a random walk with exponentially distributed step sizes. The question then becomes “How likely is it that the walk with the largest step size on step 9 is also the walk that has deviated furthest from zero?”

(This depends on how many random walks there are, and maybe also on the variance of the step size, but I ignored variance and just MLE fitted the number of cyclists/random walks.)

This resulted in a prediction of 13 %, which I think was closer to correct than the community’s 30 %.

## Dow Jones barrier (+0.013)

The community was too pessimistic, and so was I. I’ve written about this one already – had I only modeled it statistically from the start, I would have performed much better.

## LK-99 replication (-0.676)

This was my first serious failure: a replication was attempted very quickly, a prospect I had at an embarassing 9 %. I think there are two big and one small lessons I take away from it:

- Don’t underestimate the “it takes just one” effect.
- Do run quick Fermi estimations, especially when you already have a strong gut feeling.
- Not all materials scientists are burdened by extremely slow-moving bureaucracies.

## FSO Safer oil transfer (-0.092)

This is a case where I might have to update my entire world view to get a better prior next time. In my head, any big project takes longer than anticipated due to unforeseen problems. The oil transfer off of the fso Safer went off smoothly.

Either

- Not all big projects take longer than anticipated – then how would I recognise each kind? or
- The oil transfer
*did*take longer than anticipated, but most of the uncertainties had already been ironed out of the project by the time the transfer proper started.

## Luisa González plurality (0.133)

Sure she did, and I think that was fairly clear from the polling. This is one of multiple questions where I could have done better if I (a) trusted polls more, and (b) knew how to properly aggregate across different polls. The latter is still something I need to figure out – if you have a reference, hit me up!

## Surovikin stripped of command (-0.096)

There was a class of questions where media speculation biased me toward “the thing will happen” when in fact a clear head would have realised “the thing will not happen” is more likely. Another example of this is India changing its name to Bharat.

## UAW strike against Big Three (-0.250)

It seems I have a general anti-strike bias. In my mind, large strikes belong to that category of things that seems like they will happen right up until the moment they surely won’t anymore. Thus I go in very low on these questions, which mostly works out, but when it doesn’t, it hurts. I think my prior is a bit too low in this category.

## Accumulated cyclone energy (-0.349)

At this point it’s just silly. I put way too much emphasis on climatology and almost entirely disregarded freely available expert forecasts which turned out to be very accurate.