Two Wrongs

Getting Better at Counter-Strike

Getting Better at Counter-Strike

In the previous article we learned

  1. what Counter-Strike is;
  2. that skill in Counter-Strike is mostly (but not only) about getting kills and staying alive, possibly in that order;
  3. that rationally taking a do-or-die duel in the game requires a small edge – specifically, a 54 % success chance.1 Under the assumption that the Leetify rating is a good proxy for overall success. This was the main research question of the article.

And to be clear, in that article – as well as in this article – we’re not really talking about Counter-Strike. What we’re trying to learn are principles and techniques by which we can learn to improve at any activity.2 If we wanted to get good at Counter-Strike, we would not do statistics, we would look up specific Counter-Strike drills used by professional players. They might not be optimal, but they will sure as hell be sufficient for an average player like I. The great thing about not being good at something is that improvement does not have to be particularly optimised to be effective. In other words, we’re actually trying to learn how to measure and decompose skill, and we happen to use Counter-Strike as an interesting example.3 I should also say that I’m mainly speaking of Counter-Strike the way it is played at my – very average – level. I’m sure at lower levels it’s all run-and-gun and at higher levels it’s 4D chess, but I can’t speak to that because I don’t play at those levels.

Reader feedback on the previous article highlighted two problems with it:

I have no excuse regarding the first criticism – that was my bad. In regards to the second my defense is that I didn’t really set out to try to answer that. The main question was in the choice of whether to engage or not engage. How the engagement happens was deliberately left as an exercise for the reader.

Let’s see what comes out when we try to fix both anyway.

Context on rating and skill measurement

There are two levels of success measurement: one forward-looking, slow-moving rating, which tells us how good a player will be in the near future, and one backwards-looking, fast-moving attribution, which tells us how successful a player was in specific matches in the past.

Why Elo-type ratings are amazing

Rating, in the Elo sense which seems to be popular today, is the gold standard for measuring skill because it’s both meaningful and objective.

You can assign any old number you want to a performance, but a proper rating has predictive power. If we know the ratings of the players in a match, we can convert that information into an accurate probability of winning or losing the match. That’s what a rating is. You can assign a rating number however you want, as long as there’s a rule for how to convert it into probability-of-winning.

I’ll stay on this point because it’s important. Someone can assign a rating by sticking a wet finger into the air, or by an advanced, statistical, computationally heavy iterative formula. It doesn’t matter. What matters is being able to translate it into a probability of winning a specific match. The rating that yields the best win predictions is the best rating, no matter how it was constructed.

This is what is meant by meaningful: a rating directly measures skill through chance of winning. And it’s objective because the best rating system is determined by real-world outcomes rather than someone’s opinion. Thanks to its objectivity, a rating system can be calibrated over time to increase our faith in its meaningfulness.

The problem with ratings is that to yield good predictions, they necessarily have to update slowly. If they update too quickly, they will overfit to noise, and noisy predictions are bad predictions.4 I think the Counter-Strike 2 Premier rating actually suffers from this. My rating wiggles up and down a lot – by almost 15 % of the available range! A rating does an excellent job of measuring someone’s skill over the long term.

When we need short-term metrics

For improving our game5 And crafting handy decision rules such as taking fights that are 54 % in our favour. we want faster feedback. That’s where per-match metrics like the Leetify rating we saw in the previous article comes into play. It asks not “how good is this player” but “how much of their team’s win in that specific match was caused by this player?”

We can’t know how good a player is based on one match alone, because there’s a lot of noise in there. But we do have a remote chance of figuring out how much of the outcome can be attributed to the player, because we’re now asking about something concrete that have just happened, rather than an abstract quality inherent to a player.

There’s a danger of conflating the two questions. Someone contributing much to a win does not necessarily mean they are good. They could just have been lucky. But in constructing and computing this per-match metric, we have to figure out what it actually means to contribute to a win, and that’s a valuable question to ask!

When asking that question, it is tempting to get subjective. Here’s what one service, tracker.gg, says about their system:

[…] we’ve picked specific factors that we think are strong performance indicators. […] We understand deciding what stats are chosen for performance rating is controversial. Our choices for factors were made based on a combination of intuition about the games we love, and a more boring detailed analysis. With that said, we believe the community has great intuition, and if we find that a change is needed to how or what we’re scoring, we’re ready to make adjustments.

It sounds as if they are measuring performance based off of their own vibes, and/or community vibes. Of course that is going to be controversial! When evaluating people, choosing subjective criteria is inviting controversy.

But it doesn’t have to be this way. Statistically inferring causality based on past events can be difficult and hairy, but it is possible.6 Causality: Models, Reasoning, and Inference; Pearl; Cambridge University Press; 2009. We have the data. Instead of picking factors someone thinks are important performance indicators, we can figure out what are important performance indicators. We can figure out how important they are relative to each other, and use that to weigh them, rather than pick weights based on what someone thinks sounds good. This can be really simple to start out, and then expanded and deepened as more factors are discovered with time. That’s what I like about Leetify – it appears they are doing exactly that.7 I have also had recommended another alternative service, scope.gg. From what I can tell, their rating is similar to the hltv 2.0 rating, i.e. it decomposes into kills, deaths, or proxies thereof and does not try to capture the nuance of the game the way the Leetify rating does. When they attribute +12 rating to a player, it’s because that player caused an event in game that improved their team’s win probability by 12 %. Over large timescales, this is verifiable!

The difficulty with data-driving short-term metrics

I think the reason many people prefer not to do this is that it often gives a boring and uninteresting result. Counter-Strike matches are really noisy. Many of the factors people want to matter are probably inconsequential enough in the big scheme of things that adding them to a model mainly serves to increase noise, rather than improve its predictive power.

A concrete example of this from the previous article is the number of assists a player has gotten per round. An assist means a player did non-trivial damage to an opponent that was later killed by another teammate. An assist ought to be a good thing. It ought to matter. It speaks to the part of our brain that does logic. Obviously an assist is good. But when we chuck it into the model, we learn that it’s not important at all. That kind of discovery annoys people to no end. Statistical reasoning often ends up standing in opposition to logical reasoning, and we have been taught to revere logic above all else.8 This rant bleeds into another article on statistical literacy I’ve been intending to write, so I’ll try to keep it brief.

Recently, Leetify seems to have made a similar discovery around smoke grenades. Sure, they are good. They are important. But when Leetify had finally nailed down an exact definition of what should count as a good smoke grenade, it didn’t appear to be correlated with winning at all, so they had to take that definition back to the drawing table. I think most of the detailed definitions my readers’ could think of for what constitutes “a good smoke grenade” would end up not correlating at all with victory.

More actionable statistics

There were two problems with the specifics of the analysis in the previous article.

  1. It used concrete game events to predict the Leetify rating, rather than the winner. This is only a problem if you don’t believe the Leetify rating is a good proxy for winnage.
  2. It stopped at the kills-per-round metric. This is not really a good stopping point, because it doesn’t help us get more kills. If anything, the kill rate should be the starting point for more actionable analysis. What do people do to get more kills?

Leetify rating is a fair match outcome predictor

We can start by answering the first question. I have downloaded data from my performance in something like 60 of my last matches.

Among these, there are 26 wins, yielding a win rate of 45 %. My Leetify rating in these matches is distributed thusly:

gebe-cs-01.svg

We learned in the previous article, and can confirm here, that my Leetify rating is negative on average. This coincides with the win rate which is below half.

Now, we wanted to know if the Leetify rating is a good predictor for match outcome. We can tell immediately from a scatterplot (and an +0.6 correlation) that it is some sort of indicator.

gebe-cs-02a.svg

This is actually a little surprising. Remember this is just the Leetify rating of one player on the team. That player could have the game of their life but be paired up with bad teammates. Or they have an off day but happen to be paired up with good teammates. Despite all those unknown variables, the performance of one player looks like it’s a meaningful indicator of how the game went.9 If you guessed victory when that one player had a positive rating and loss otherwise, you’d be right more than 75 % of the time! This is an argument against the so-called “Elo hell”: if you consistently play well, you’ll have a 75 % win rate on average, even if you sometimes get matched up with a bad team.

If we do a logistic regression of Leetify rating against match outcome, we get a fitted curve like this, which translates a Leetify rating into probability of winning. Note that it does this based on my recent match history – this is not the definition of the Leetify rating, it’s just a consequence of the outcomes of my recent matches.

gebe-cs-03a.svg

How do we know if this is a good fit? We can compute its forecasting error with cross-validation. The Leetify rating gets a Brier score of 0.17 when trying to predict the outcome of my matches. I would consider that to be fairly good – if I had that score in a forecasting tournament I would give myself a pat on the back.

It’s not perfect. We knew that ahead of time. I don’t think there’s any way to use the statistics from a single player to determine match outcome perfectly, because it is going to depend on everyone else on the team as well! But it’s fairly good.

Predicting match outcome from player kills and deaths

We could construct a similar model based on e.g. the excess kills per round, which is what we’ll call the difference in kill rate and death rate.

gebe-cs-02b.svg

This would have been similarly surprising, if we hadn’t discovered in the previous article that the Leetify rating is to a large extent made up of just excess kill rate.

With the same logistic regression, we get a very similar curve.

gebe-cs-03b.svg

This model, based purely on kills and deaths, how well does it predict match outcome compared to the Leetify rating? Turns out it’s very close: Brier score of 0.18. Comparing the two models’ forecasting error distributions, we can also see that they are practically equivalent.

gebe-cs-04.svg

My conclusion here is that yes, the Leetify rating is a good match outcome predictor – but it’s not much better than the excess kill rate. Since the excess kill rate is much easier to compute10 The Leetify rating is somewhat of a trade secret, after all, so I wouldn’t be able to replicate it. Anyone can replicate the excess kill rate!, I actually prefer that model.

Figuring out what makes up the kill rate

The next thing we might want to do, then, is figure out what players can do to improve their chances of getting kills. Here are some statistics easily available from Leetify that might help get kills:

  • Average flick angular distance
  • Body shot accuracy
  • Head shot accuracy
  • Spray accuracy
  • Counter-strafing
  • Time to damage
  • Enemies flashed
  • Average blind time
  • Grenade damage
  • Fire damage
  • Unused utility
  • Shots fired
  • Trade kill opportunities
  • Trade kill attempts

In my – very small – data set, only one of these variables is correlated with excess kill rate.

It wasn’t obvious to me at first which one that would be: number of shots fired. In other words, players that shoot their gun more also get more kills before they die. My best guess for why this relationship comes up is that it’s an inverse causation: people who happen to stay alive for longer will both get more kills, and get more opportunities to fire their gun.

This means it’s sort of another dead-end from the perspective of getting actionable analytics out of the enterprise. It’s highly likely these other variables do help getting kills, but with the limited data at my disposal, we just can’t tell signal from noise.11 If only Leetify had a public api that could be used to get more data!

So the question of why we don’t look at more actionable analysis is simply that I do not have the data for it. I have many ideas of things that could be explored by downloading a large amount of match demo files and parsing them programmatically, but it is not important enough for me to spend more time on. Sorry!