Two Wrongs

Quarterly Cup 2023 Q4 Retrospective

Quarterly Cup 2023 Q4 Retrospective

During the past three months I have competed in the latest Metaculus Quarterly Cup forecasting tournament, trying to out-predict other Metaculus users on short-fuse questions.

After the 2023 Q3 tournament, I summarised my thoughts, and I’m doing it again because I think it helped me learn from my mistakes.

This article is divided into four main sections, although they slightly float into each other:

  1. A brief summary of my performance.
  2. Notes on the value of information.
  3. A statistical look at what worked and what did not.
  4. Descriptions of how I arrived at my predictions for specific questions.

Just the numbers

My average blind1 As in without seeing the community prediction. Brier score2 A lower Brier score is better, and the range people practically fall in is something like 0.07 (best) to 0.25 (random guessing.) hovered around 0.13 throughout the tournament. Compared to last tournament when it started at 0.23 and improved to 0.15 over the course of the tournament, this seems like it should be reflected in better tournament performance.

That said, Brier scores are difficult to compare across situations because easy questions lead to higher Brier scores without implying an improvement of skill, so I’m still not sure if this is a meaningful signal.

This time I did very well

Just like in the last tournament, there were a lot of participants – specifically, 825 of them. I ended up in second place on the leaderboard, which is better than I expected.

However, most participants only submit forecasts on a handful of questions, and these are penalised in the tournament leaderboards, so we can beat the vast majority of participants by just showing up and submitting a timid prediction on every question. I maintain that the real measure of tournament performance is the average score per question. Unfortunately, that suffers from the opposite problem: it’s heavily skewed toward people who got lucky on a few questions and then stopped forecasting.

To get something meaningful out of it, we’ll select only participants who submitted forecasts on 35 % or more of the questions3 The fraction is chosen following a similar methodology as before, except now looking at the most extreme scores instead of the most average scores. I think that’s a more meaningful way to find out at what point we’re including people who simply had unusually good or bad luck on a small number of questions., and then look at the average score per question for each participant. This leaves 58 people in the race. Of these, I placed … third best! With an average relative log score of +0.029 per question.

quarterly-cup-retrospective-2-01.svg

This is a result I’m insanely happy with. As regular readers know, the Metaculus tournament log score is relative to the community median forecast; in other words, thanks to what we learned from the last tournament – and arguably a good portion of luck – my forecasts beat even the wisdom of the crowd this time around.

Information, bias, and noise

Jake Gloudemans, who has won both Metaculus Quarterly Cups he participated in, wrote his own retrospective article after the previous tournament.

What’s really interesting is when he says,

I’m also the kind of person that enjoys tracking 15 different events as they unfold by meticulously doing filtered Google searches each day and keeping a library of bookmarks and Twitter accounts to check in on repeatedly. […] I was already sort of doing this just in a more passive way before I started forecasting, so it felt like a very natural transition from my prior online habits.

Later in his retrospective, in a really good discussion on the perverse incentives around comments in tournaments, he goes on to say,

There were a few times in the tournament where I realized that due to some hard-to-track-down piece of information that the community probably wasn’t considering, the real probability should have been <1% for a question, while the community was sitting at 20 or 30%. With a simple comment, I could have easily shifted the whole community to my number, thereby improving the performance of the site. But this would have eliminated any points that I would gain in the tournament from being more right than the community, so I didn’t comment.

I have written before about the three things that make up the quality of a forecast: information, bias, and noise. Here, Jake Gloudemans expresses the importance of information in short-term forecasting.

This is counter to my approach, which so far has been about reducing the bias and noise of my forecasts. I don’t pay much attention to information at all – in fact, my main source of information regarding my forecasts is the Metaculus comment section itself! (For an example, see the US military deaths question under Lessons Learned.) The whole “filtered Google searches and a library of Twitter accounts” thing sounds utterly exhausting to me and I don’t think I can ever bring myself to do it.

But also maybe my complete disregard of information is the next weakness I need to overcome to improve my forecasting. After all, it comes from the keyboard of someone who has out-predicted me over 93 questions with a very generous margin.

More Statistics

We are now at the section where we look statistically at what worked and what didn’t in this tournament. Comments on specific questions will be the next section.

Continuous questions continue to be hard

As before, non-binary questions where we need to forecast an entire probability distribution are more difficult. The contribution from each type of these questions to my average log score was:

Type of question This tournament Last tournament
Binary +0.034 -0.026
Continuous +0.002 -0.073

By looking at a histogram of the scores split up into the two types of question, we can also see that the continuous questions have wider variance, i.e. they were wilder guesses on my part.

quarterly-cup-retrospective-2-02.svg

That said, they are not as broadly distributed as in the previous tournament4 here theyW ranged from -1.2 to +0.3 so at least I’m a little more careful now. Some of the problems I had with continuous questions this year were actually not a failure to forecast, but a failure to input the distribution I had in mind. (See the New York rain question under Lessons Learned.)

Quantitative modeling can be extremely simple

In the previous tournament I discovered that quantitative modeling is where I have an edge over the community, and I wanted to do that for more questions this time around. I did, and while the edge has persisted, it is smaller.

Type of forecast This tournament Last tournament
Modeled quantitatively +0.04 +0.09
Went by gut feel +0.03 -0.06

I think the decrease in the average edge is to be expected, because I broadened the scope for which questions I considered quantitative modeling. Even questions that don’t seem like they are amenable to quantitative reasoning can be forecast using e.g. a tallying heuristic over a small ensemble of forecasts, where we make two or three different types of forecasts and then take the average of those to get the final number (see the Scottish human chain question under Lessons Learned for an example). I did that type of light quantitative modeling for more questions than last tournament, so naturally the effectiveness of what counts as quantitative modeling will go down.

Here’s the density of scores for forecasts, broken into groups based on whether they were modeled quantitatively or I picked a forecast from thin air.

quarterly-cup-retrospective-2-03.svg

What concerns me in particular about this plot is that while quantitatively modeled forecasts seem to consistently profit a small number of points, the gut feels are all over the place. That suggests any apparent improvement in gut feel is more about luck than anything else, and I should continue to find ways to build forecasts from numbers.

Actor–motivation modeling was surprisingly effective

However, I also tried a new qualitative way to model forecasts: using some sort of actor–motivation framework, where I try to think of which people have a bearing on the outcome, and what their respective priorities are. I read about someone using this type of reasoning a while ago, but I have – annoyingly – forgotten where.5 If you happen to know, please email me! I want to learn more.

We’ll compare this actor–motivation framework to quantatively modeling (which we know has a meaningful effect) to get a sense of whether its effect is meaningful.

Type of forecast This tournament
Actor–motivation model +0.031
Quantitative modeling +0.035

Although the data is shaky because I didn’t use it for very many questions, the effect looks comparable to quantitative modeling, which is neat.

quarterly-cup-retrospective-2-04.svg

There might be something to it, although it is worrying this framework is responsible for my greatest loss on any question this tournament. I’ll continue to explore it, but prudently.

Lessons Learned

This tournament I did not write many comments explaining my reasoning, so here are some explanations for the curious.

Scottish human chain (not) continuous (+0.13)

My forecast: 25 %, when the community forecast was 50 %.

 

The Chain of Freedom Scotland organised an event with the purpose of people holding hands all the way across the narrowest part of Scotland. This question would resolve positively if the human chain did indeed stretch across Scotland – although people were allowed to cheat and hold hands through banners.

My first attempt at modeling this was based on the following assumptions:

  • Size of Scottish population: 5,000,000
  • Fraction of Scottish population supportive of the cause: 45–60 %
  • Fraction of humans who are interested in this sort of event: 20–50 %
  • Fraction of interested humans actually intending to attend: 2–20 %
  • Fraction of intended attendants who show up: 70–95 %
  • Number of people who need participate for success: 50,000

I plugged these numbers into Precel to compute the final percentage and it responded with 42 %.

However, I also noticed in the description of the event that they had divided the width of Scotland into 11 subsections. That made me realise that the success rate of the entire endeavour is capped at the success rate of the least-likely-to-succeed subsection. In other words, even if 10 subsections are guaranteed to be successses – as long as just one subsection has a success chance of 30 %, the entire question will have a resolution chance of 30 %.

Since much of the planned chain went through well-populated areas, I speculated the average successs rate for a subsection could be as high as 90 %, but I imagined that there is likely one trouble spot that has a success rate of just 30 %. That means the total chance of success of the endeavour is 14 %.

The average of 42 % and 14 % is 28 %, but the shape of the problem (all subsections must be successful for the question to resolve positively) made me lean slightly lower, so I forecast 25 %.

Conservatives lost Mid Bedfordshire (+0.14)

My forecast: 40 %, when the community forecast was 50 %.

 

The Mid Bedfordshire constituency has been held by the UK Conservatives for many decades, but polling indicated it would go to Labour in this by-election.

I don’t know why the community forecast was as high as 50 %. Polling was very clear in this case, and the only reason I didn’t go lower than 40 % was that I was concerned I had misunderstood something about UK politics. Maybe that was true for the rest of the community also.

Speaker of the house needed four ballots (-0.28)

This was one of the first questions for which I tried the actor–motivation model. I reasoned that both sides would most likely want to avoid the drawn out process of the last election, and a candidate would not be proposed unless they had already found broad agreement. In other words, I expected the disagreements to turn into delays, rather than more ballots.

Compared to the community, I had my money on “it will either take just one or two ballots (because they are careful), or more than 10 ballots (because they throw candidates at the wall to see which one sticks).” Turns out they did something in between: they carefully threw candidates at the wall, which made one stick after four ballots.

That was an outcome that seems obvious in hindsight, but I under-weighted it at the time.

It stops raining in New York (-0.12)

This question I lost points on due to technical difficulties, rather than a bad forecast. New York City had seen many consecutive weekends of rain during the autumn, and the question was when the rainy weekends would stop.

Because of a missing feature of Metaculus that has been added since this question6 At the time, Metaculus did not have a multiple-choice question type, so questions like these needed to be asked with a continuous distribution of dates., answering this question required carefully aligning pixels using slider inputs. I accdentally imisaligned my pixels in a way that made the system think I had forecasted the rainy weekends to end on a Monday morning rather than the last hour of a Sunday, which is when the weekend actually ends. I lost some points on this tecnicality.

I’m glad Metaculus now has multiple-choice questions so we don’t have to align pixels as often.

New Delhi has shitty air quality (+0.11)

My forecast: 95 %, when the community forecast was 75 %.

 

The appropriate forecast for this would actually have been something closer to 97 % than 95 %. The reason I submitted a lower forecast was the spotty and difficult-to-use historic data that was available. There were two main problems:

  • Missing data was sometimes encoded with an “n/a” type value, sometimes as a 0, and sometimes that row was just not present. This made it hard to align the data from multiple years without writing a script for it.
  • The data from the current year was split into two different files, and it took some time before I realised how they were related.

Had I taken the time to sort out those data problems earlier, I would have submitted a more confident forecast. I suspect the same problem affected the community forecast, though, so I didn’t suffer much from it.

Liberia does (not) re-elect its incumbent president (+0.21)

My forecast: 50 %, when the community forecast was 60 %.

 

The community did seem fairly comfortable “betting on the incumbent in an African election”, as a commenter put it. Maybe 60 % was the correct forecast, I don’t know.

The reason I bring this question up is to note how many points one can get from a non-committal 50 % forecast when the community is just 10 percentage points on the wrong side of maybe. +0.21 corresponds to more than a sixth of my total score. It really pays not to be too confident.

Milei gets many Argentine votes (+0.23)

I have this idea that national populist politicians get more votes in the real election than polling indicates. I don’t know why that is, nor from where I got the idea, but it contributed to my score on this question.

The bigger contribution, though, was a very wide probability distribution. I don’t know why but it’s extremely easy to submit too narrow distributions on continuous questions.7 Although I have no evidence of it, I secretly speculate that maybe something about the algorithm that computes the community prediction does so in a way that makes it seem like the community is more certain than it is.

Guayana Esequiba referendum supportive (-0.07)

My forecast: 80 %, when the community forecast was 95 %.

 

I think the community had the right forecast here, but I understood so little about the matters involved – and I didn’t have time to research it either – so I opted for a more conservative forecast of my own. Despite losing some points on this, I think it was the right thing to do.

The MONUSCO mission is extended (+0.18)

My forecast: 79 %, when the community forecast was 38 %.

 

This was a question where I had a strong sense it was likely to resolve positively, but it was based mainly on gut feel. There was also some debate around what would really count as a positive resolution, further increasing my uncertainty. Thus, when the community forecast was revealed, I quickly reverted down to something very close to it.

I don’t know if that reversion was a mistake. It is annoying to look back on the times when my gut was right but I didn’t listen to it. On the other hand, the statistical evidence above suggests that it’s maybe a losing policy to pay too much attention to my gut as a rule, and the correct thing is to revert closer to the community forecast unless I have stronger reasons not to.

I had the same problem when evaluating whether any shipping company would return to routing ships through the Red Sea after the Houthi attacks intensified. (They did, and I strongly suspected they would, but I wasn’t confident enough to stray very far from the community prediction.)

US military (does not) suffer deaths in Red Sea (+0.00)

My forecast: 5 %, when the community forecast was 3 %.

 

I know the death rate of the US military is very low, even among troops in contact with hostile forces. So I just guessed fairly wildly at 5 %. Then halfway through, someone prompted me to look at the data and at that point I realised how low the death rate really is, and I updated to 1 % immediately.

Clearly, I should have just taken three minutes to look up the data from the start and opened at 1 %. But I’m lazy! Here’s that information deficiency again.

Claudine Gay resigns … two days too late (+0.04)

My forecast: 6 %, when the community forecast was 20 %.

 

The question was whether Claudine Gay would resign before January 1, 2024. I opened way below the community, and even as time went on and the community updated downwards, I did so more aggressively, and ended up at 1 % when the community was at 4 %.

I was lucky. Claudine Gay did not resign before January 1, 2024 – she resigned on January 2.

What confuses me about this question is that a part of me still thinks my forecast was more correct than the community. Yet – had she resigned just a couple of days earlier, I would have lost a lot of points on this one. Did we actually nearly witness a 1 % probability event in a forecasting tournament, or was the community forecast more accurate?

I honestly don’t know, but it seems important for my future performance that I figure it out.

Summary

Random observation: normally when we cross into the new year, I keep writing the old year whenever I’m asked to write down a date, because I’m so used to it. I would have expected to habitually write “2023” well into February 2024. This year, however, I noticed the opposite effect: I started habitually writing “2024” in late November. I think it’s because I’ve been forecasting on questions that end in 2024.

I don’t think I’m more forward-looking than before I started forecasting, but I feel like I’m less stuck in the past. I have some changes I want to make in the next Quarterly Cup, which opens on January 8:

  • Holding on to my gut feeling slightly longer, even against my better judgement8 Not because it’s a winning move, but because I want to try to see what happens..
  • Taking a moment to first break down a question into the most important assumptions that have to be made, rather than rushing to a probability.
  • Spending a few minutes looking up data to support those assumptions.
  • Coming up with a broader range of possible scenarios to weigh into the forecast.
  • Trying to find a way to blend the actor–motivation framework with quantitative reasoning.

I’m also participating in another slightly longer Metaculus tournament and the acx 2024 prediction competition, where I’ll try to apply these lessons as well.

If you are thinking of participating in a forecasting tournament but you are unsure for whatever reason, reach out to me and ask! It’s fun and not at all scary. These short-fuse tournaments are great because you get fast feedback on what goes right and wrong.

Referencing This Article