Hello, pokefans! At this point you've no doubt noticed that this month's usage stats look a bit different. That's because, instead of just counting the number of times each Pokemon appears on a team, our stats are now weighted by a system that takes into account a player's ranking. The point of this thread is to explain our system and to answer any questions any of you might have.
I am not interested in your opinions on whether or not this weighted stat system is a good thing. Any posts along those lines will be deleted.
tl;dr--if you're a "good" player (meaning you're not a newbie, and you're not using a troll team), our new weighting system will not noticeably affect how your teams are counted. Beyond a fairly low cutoff, having a higher rating will not lead to your teams being "counted more." The purpose of this weighting system is to lower how much bad, and especially deliberately bad teams affect our stats.
What are usage stats? Why are they important?
Let's start simple: the usage statistics we publish each month reflects the Pokemon that are used on our Pokemon Showdown simulator server. Each rated battle is logged, and these logs contain full team data (not just the names of the Pokemon but also their EV spreads, movesets, etc.) as well as a log of the battle itself.
Each month, we distill these logs into (what we hope is) useful information: moveset statistics, metagame analyses, and, most importantly, usage statistics, which report how often a given Pokemon is used throughout the month. Why is this information important? Well, I hope the moveset and metagame information is useful for players trying to get into a new metagame or trying to improve on their teams (for instance, if your team is having trouble with Ferrothorn, it might be useful for you to go into the moveset data and see that--statistically--the best counters for Ferrothorn are Volcarona, Heatran and Torkoal).
Usage statistics are even more important, as we use them to determine our tiers.
What does OU mean?
One of the grandest and most controversial questions we have in this community is what our tier list represents. Is an OU Pokemon more powerful than one in a lower tier? Is Gastrodon a "better" Pokemon than Victini or Zapdos?
My personal answer to this question is no: our tiers do not do a good job of ranking a Pokemon's "power." So if a UU Pokemon isn't inherently better than one in RU, what's the point of tiers?
Again, this is a controversial subject, but my answer is that OU means what it stands for, that these Pokemon are simply "overused," and that the primary function of tiers is as threat lists. To elaborate, I'm going to point you folks to the original defining of our current OU-UU cutoff: in short, a Pokemon is OU if, in playing 20 battles, there's at least a 50% chance of you encountering that Pokemon at least once. This is an acknowledgement of the fact that there are 649 Pokemon out there--if you're designing a team of six Pokemon, it's unlikely that you're going to be able to make sure that your team has a way of dealing with each and every Pokemon out there. But if you're making an OU team, you probably will never have to worry if your team gets completely wrecked by Leavanny, since it doesn't even appear on one team in a thousand. What the OU/UU cutoff literally says is: "if said Pokemon is UU or below, you still have a good shot of going 20-0 even if your team is super weak to that Pokemon."
So keep that philosophy in mind for the rest of this article.
The old system and why it needed to be fixed
So tiers are threatlists, and we determine tiers through usage statistics. The way we did this in the past was simple: we simply counted up all the Pokemon that appeared on an OU team in a given month, then, after three months, we combined usage statistics, weighting the most recent month 5/6ths, the month before 1/8th and the month before that 1/24th, and if the combined percentage was greater than 3.406367107%, we declared that Pokemon OU (repeat the process for UU to get RU, RU to get NU, and NU to get the unofficial PU tier).
I should note at this point that not all teams were counted under the old system. If a battle lasted less than six turns, it was thrown out. The idea here was to make it harder for people to "spam the stats" by forfeiting immediately and then looking for a new battle. It also means that if you selected the wrong tier by mistake or made a mistake in your teambuilding, and you decide not to go through with the battle (for obvious reasons), then your team doesn't get counted towards the usage stats.
All that was well and good in the PO days, when you needed to download and install a program, then select a server that wasn't the default, in order to play on our ladders. Pokemon Showdown, however, has a much lower barrier to entry than Pokemon Online, and Smogon is also the default server. This means that we see tons more activity than we used to (all in all a good thing), but it also means we see a ton more inexperienced and "casual" players. And under the old system, the battles of someone using Ash's team from the anime are weighted the same as the battles of our most skilled players.
Is this a bad thing? I remind you of the philosophy behind our tiers: they're threatlists. It doesn't matter if Pikachu is everywhere in the metagame, you still don't need to know how to counter it if the only people using it think you can take out a Rhydon by "aiming for its horn."
I wish I were exaggerating, but you can actually look this up in the moveset statistics:
The average "weight" of a player using Pikachu in OU is 0.239 (I'll get into what that means in the a few sections) and its top teammates are, in order Charizard, Blastoise, Venusaur, Snorlax and Lapras.
Of course, Pikachu isn't about to go OU, but Charizard lives fairly close to the cutoff, and there are plenty of other Pokemon whose usages are boosted disproportionately by inexperienced users who aren't really a threat to any decent player.
There's a separate, more important issue that led us to abandon the old system: spammers who deliberately try to manipulate the tiers. While the old system already tried to make "spamming the tiers" difficult by discarding early forfeits, the bottom line is that we could do nothing about a player who had a team of six Pokemon, all with useless items and awful movesets, who lost his battles without forfeiting.
What we want in a stat system
So if the old system is broken, we're left with the issue of what we do to replace it. Thinking on it for a while, I came up with the following criteria for any new stat system I could come up with:
I dwelled on these criteria for a good long while, and after a lot of careful thought, I came up with the following principle:
A pokemon should be OU if it appears a sufficient number of times on teams used by players who are better than average.
You might think that this concept of "better than average" would be difficult to define, but it's really not--Pokemon Showdown rates players based on a system called Glicko2. In contrast to PO's Elo system, your Glicko2 rating is actually two numbers: a rating (R) and a deviation (RD). The idea is that it's impossible to know a player's "true" rating, but we can guess that that rating falls within a probability distribution, in this case, a normal distribution of width RD and center R.
Below are two sample distributions: one for a player of rating 1600±100 and one whose rating is 1575±50.
For those unfamiliar with probability distributions, the idea is that the probability of a player's "true rating" being in the infinitesimal interval r<rating<r+dr is f(r)dr.
So looking at these two graphs, the question is who's the better player (assuming a player is "better" than another if his or her "true rating" is greater)? Using the Glicko2 system, it's impossible to know for certain, but what we can determine is the likelihood that one player is better than another player.
Which brings us back to the idea of the average player. Not to get too technical, but with our rating system, each player starts off with a rating of 1500 and a deviation of 350.
"But wait, Antar," I hear you say. "My starting rating was 1000, not 1500.. Alas, you're confusing your rating with your ACRE, which is a conservative rating estimate. CREs basically say, "I don't know for certain how good a player is, but I'm pretty sure he or she is better than this. Specifically, there's about a 92% chance that your true rating is better than your ACRE. Okay. Glad I got that out of the way. When I'm talking ratings here, I'm talking Glicko2, not ACRE.
Okay, so each player starts off with a rating of 1500. As they battle, their rating goes up when they win, down when they lose. I'm not going to get into the specifics. The bottom line is, at the end of the day, the distribution of the ratings of all the players should pretty much be centered around 1500. And thus, we define the average player to have a true rating of 1500.
Our new weighting system
So with our stated premise that we're looking only at teams used by players who are better than average, and with our definition of average, the obvious solution is to throw out teams by players whose rating is less than or equal to 1500. While this is certainly simple enough, it has the problem that, in the Glicko2 system, we don't actually know a player's true rating. We only know the probability distribution corresponding to their rating.
But all is not lost, because it's relatively simple to figure out what the likelihood is that a player's true rating is greater than 1500: it's just the integral of their rating distribution from 1500 to infinity, which works out to be
P(r>1500)=(1+erf((rating-1500)/(sqrt(2)*deviation)))/2
It is this probability that we will be using to weight our stats
To get a feel for how this weighting system works, here are some sample calculations:
Here are some sample calculations:
So this system has the properties I was looking for: bad player only counted at about 20% or less, while mediocre teams counts almost 80% and good teams--no matter how good--are pretty much fully counted.
Put another way, here's a graph showing what percentages of players make up what percentage of the stats (this is for OU, for the first half of January)
To summarize:
So it doesn't matter if you're in the top 1% or the top 20--your contibution to the stats is basically the same. But then, if you're in the bottom quarter of the tier (meaning you're either really bad or a troll), your teams contribute basically not at all to the usage stats.
As it should be.
Frequently Asked Questions
I am not interested in your opinions on whether or not this weighted stat system is a good thing. Any posts along those lines will be deleted.
tl;dr--if you're a "good" player (meaning you're not a newbie, and you're not using a troll team), our new weighting system will not noticeably affect how your teams are counted. Beyond a fairly low cutoff, having a higher rating will not lead to your teams being "counted more." The purpose of this weighting system is to lower how much bad, and especially deliberately bad teams affect our stats.
What are usage stats? Why are they important?
Let's start simple: the usage statistics we publish each month reflects the Pokemon that are used on our Pokemon Showdown simulator server. Each rated battle is logged, and these logs contain full team data (not just the names of the Pokemon but also their EV spreads, movesets, etc.) as well as a log of the battle itself.
Each month, we distill these logs into (what we hope is) useful information: moveset statistics, metagame analyses, and, most importantly, usage statistics, which report how often a given Pokemon is used throughout the month. Why is this information important? Well, I hope the moveset and metagame information is useful for players trying to get into a new metagame or trying to improve on their teams (for instance, if your team is having trouble with Ferrothorn, it might be useful for you to go into the moveset data and see that--statistically--the best counters for Ferrothorn are Volcarona, Heatran and Torkoal).
Usage statistics are even more important, as we use them to determine our tiers.
What does OU mean?
One of the grandest and most controversial questions we have in this community is what our tier list represents. Is an OU Pokemon more powerful than one in a lower tier? Is Gastrodon a "better" Pokemon than Victini or Zapdos?
My personal answer to this question is no: our tiers do not do a good job of ranking a Pokemon's "power." So if a UU Pokemon isn't inherently better than one in RU, what's the point of tiers?
Again, this is a controversial subject, but my answer is that OU means what it stands for, that these Pokemon are simply "overused," and that the primary function of tiers is as threat lists. To elaborate, I'm going to point you folks to the original defining of our current OU-UU cutoff: in short, a Pokemon is OU if, in playing 20 battles, there's at least a 50% chance of you encountering that Pokemon at least once. This is an acknowledgement of the fact that there are 649 Pokemon out there--if you're designing a team of six Pokemon, it's unlikely that you're going to be able to make sure that your team has a way of dealing with each and every Pokemon out there. But if you're making an OU team, you probably will never have to worry if your team gets completely wrecked by Leavanny, since it doesn't even appear on one team in a thousand. What the OU/UU cutoff literally says is: "if said Pokemon is UU or below, you still have a good shot of going 20-0 even if your team is super weak to that Pokemon."
So keep that philosophy in mind for the rest of this article.
The old system and why it needed to be fixed
So tiers are threatlists, and we determine tiers through usage statistics. The way we did this in the past was simple: we simply counted up all the Pokemon that appeared on an OU team in a given month, then, after three months, we combined usage statistics, weighting the most recent month 5/6ths, the month before 1/8th and the month before that 1/24th, and if the combined percentage was greater than 3.406367107%, we declared that Pokemon OU (repeat the process for UU to get RU, RU to get NU, and NU to get the unofficial PU tier).
I should note at this point that not all teams were counted under the old system. If a battle lasted less than six turns, it was thrown out. The idea here was to make it harder for people to "spam the stats" by forfeiting immediately and then looking for a new battle. It also means that if you selected the wrong tier by mistake or made a mistake in your teambuilding, and you decide not to go through with the battle (for obvious reasons), then your team doesn't get counted towards the usage stats.
All that was well and good in the PO days, when you needed to download and install a program, then select a server that wasn't the default, in order to play on our ladders. Pokemon Showdown, however, has a much lower barrier to entry than Pokemon Online, and Smogon is also the default server. This means that we see tons more activity than we used to (all in all a good thing), but it also means we see a ton more inexperienced and "casual" players. And under the old system, the battles of someone using Ash's team from the anime are weighted the same as the battles of our most skilled players.
Is this a bad thing? I remind you of the philosophy behind our tiers: they're threatlists. It doesn't matter if Pikachu is everywhere in the metagame, you still don't need to know how to counter it if the only people using it think you can take out a Rhydon by "aiming for its horn."
I wish I were exaggerating, but you can actually look this up in the moveset statistics:
The average "weight" of a player using Pikachu in OU is 0.239 (I'll get into what that means in the a few sections) and its top teammates are, in order Charizard, Blastoise, Venusaur, Snorlax and Lapras.
Of course, Pikachu isn't about to go OU, but Charizard lives fairly close to the cutoff, and there are plenty of other Pokemon whose usages are boosted disproportionately by inexperienced users who aren't really a threat to any decent player.
There's a separate, more important issue that led us to abandon the old system: spammers who deliberately try to manipulate the tiers. While the old system already tried to make "spamming the tiers" difficult by discarding early forfeits, the bottom line is that we could do nothing about a player who had a team of six Pokemon, all with useless items and awful movesets, who lost his battles without forfeiting.
What we want in a stat system
So if the old system is broken, we're left with the issue of what we do to replace it. Thinking on it for a while, I came up with the following criteria for any new stat system I could come up with:
- The system should result in tier lists that correspond to the principle that they function, first and foremost, as threat lists.
- It should be difficult to game.
- The contribution from bad players and trolls should be minimized. This is the first component of (2).
- Decent players and good player should contribute strongly to the stats.
- The difference in weighting between a decent player and a good player should be relatively small--we're not looking for "1337" stats. This is also tied to (2), as there should be no advantage in trying to game the ladder, at least where tiering is concerned.
- The system should make some sort of logical sense beyond the empirical "because it works."
I dwelled on these criteria for a good long while, and after a lot of careful thought, I came up with the following principle:
A pokemon should be OU if it appears a sufficient number of times on teams used by players who are better than average.
You might think that this concept of "better than average" would be difficult to define, but it's really not--Pokemon Showdown rates players based on a system called Glicko2. In contrast to PO's Elo system, your Glicko2 rating is actually two numbers: a rating (R) and a deviation (RD). The idea is that it's impossible to know a player's "true" rating, but we can guess that that rating falls within a probability distribution, in this case, a normal distribution of width RD and center R.
Below are two sample distributions: one for a player of rating 1600±100 and one whose rating is 1575±50.
For those unfamiliar with probability distributions, the idea is that the probability of a player's "true rating" being in the infinitesimal interval r<rating<r+dr is f(r)dr.
So looking at these two graphs, the question is who's the better player (assuming a player is "better" than another if his or her "true rating" is greater)? Using the Glicko2 system, it's impossible to know for certain, but what we can determine is the likelihood that one player is better than another player.
Which brings us back to the idea of the average player. Not to get too technical, but with our rating system, each player starts off with a rating of 1500 and a deviation of 350.
"But wait, Antar," I hear you say. "My starting rating was 1000, not 1500.. Alas, you're confusing your rating with your ACRE, which is a conservative rating estimate. CREs basically say, "I don't know for certain how good a player is, but I'm pretty sure he or she is better than this. Specifically, there's about a 92% chance that your true rating is better than your ACRE. Okay. Glad I got that out of the way. When I'm talking ratings here, I'm talking Glicko2, not ACRE.
Okay, so each player starts off with a rating of 1500. As they battle, their rating goes up when they win, down when they lose. I'm not going to get into the specifics. The bottom line is, at the end of the day, the distribution of the ratings of all the players should pretty much be centered around 1500. And thus, we define the average player to have a true rating of 1500.
Our new weighting system
So with our stated premise that we're looking only at teams used by players who are better than average, and with our definition of average, the obvious solution is to throw out teams by players whose rating is less than or equal to 1500. While this is certainly simple enough, it has the problem that, in the Glicko2 system, we don't actually know a player's true rating. We only know the probability distribution corresponding to their rating.
But all is not lost, because it's relatively simple to figure out what the likelihood is that a player's true rating is greater than 1500: it's just the integral of their rating distribution from 1500 to infinity, which works out to be
P(r>1500)=(1+erf((rating-1500)/(sqrt(2)*deviation)))/2
It is this probability that we will be using to weight our stats
To get a feel for how this weighting system works, here are some sample calculations:
Here are some sample calculations:
- A brand new player, just starting out, whose rating is 1500±350 will be weighted 0.5.
- I have a mediocre OU team i sometimes play on PS under an alt. It currently has a provisional rating of 1576±105. Its weighting is 0.77.
- The person who I just demolished with that team (you have to be pretty bad...) has a provisional rating of 1394±139. Weighting is 0.223
- My good (not great) OU team has a rating of 1946±177. Weighting is 0.994
- The person at the top of the ladder right now has a rating of 2120±55. Their weighting is 1.0, for all intents and purposes.
So this system has the properties I was looking for: bad player only counted at about 20% or less, while mediocre teams counts almost 80% and good teams--no matter how good--are pretty much fully counted.
Put another way, here's a graph showing what percentages of players make up what percentage of the stats (this is for OU, for the first half of January)
To summarize:
- 1% of teams make up 1.76% of stats (ratio 1.76:1)
- 5% of teams make up 8.79% of stats (ratio: 1.76:1)
- 10% of teams make up 17.58% of stats (ratio: 1.76:1)
- 20% of teams make up 35.16% of stats (ratio: 1.76:1)
- 50% of teams make up 80.80% of stats (ratio: 1.61:1)
- 77.53% of teams make up 99% of stats (ratio: 1.27:1)
So it doesn't matter if you're in the top 1% or the top 20--your contibution to the stats is basically the same. But then, if you're in the bottom quarter of the tier (meaning you're either really bad or a troll), your teams contribute basically not at all to the usage stats.
As it should be.
Frequently Asked Questions
- Why not just count winning teams? Wouldn't that accomplish the same thing? Well, yes and no. First off, I'm not asking what the likelihood is of a player defeating another one, but rather that the likelihood that the player is better. The distinction is dense and mathematical, and if you're curious, the probability that you will win a given match against a random opponent is given by your GXE. But beyond that, there's a bigger problem: PS will try to match you with someone with a rating close to yours. Meaning if your rating is 2000, it's much less likely that you'll be paired with someone of rating 1000, than if your rating is 1200. So, at the end of the day, race-to-the-bottom players will get paired with trolls, so half of their teams will be counted, while half of the teams at the top are counted as well, and we're little better off than when we started.
- Which rating are you using? Before the battle? After the battle? At the end of the month? Provisional or actual? With an eye towards making the system harder to game, we're using provisional ratings, logged at the end of the battle.
- What happens if there's a ladder error and the rating doesn't get recorded? Originally, we treated you and your opponent as completely unknown quantities, so your teams were each weighted 0.5, the same as a new player just joining the ladder or a hypothetical player with RD of infinity. But that method ignored a key data point: the outcome of the battle, itself. A player with exactly one win and no losses will have a rating of 1662±290, and a player with exactly one loss and no wins will have a rating of 1338±290. So it's these ratings we use if no other is available (they work out to a weighting of 0.71 for the winner and 0.29 for the loser). This is far from ideal, but the ladder error rate is pretty low, and Zarel and others are working on lowering it further.
- Could this same principle be applied to make some sort of "1337" stats? Yes, yes it could! I have experimented with looking at statistics resulting from considering the probability that a player's rating is greater than 1850 (one standard deviation above average) and above 2200 (two standard deviations above average). Interestingly, they don't look much different than the regular stats, which is why I didn't include them in January's stats thread. But due to popular demand, I did decide to "1850" stats in February's stats and probably will continue to provide them in future months.
- Will metagame and moveset stats be weighted? Absolutely. The idea behind these additional statistics is to provide players with references for set- and team-building. It makes sense to minimize the impact of bad and malicious players on those statistics. Note that "checks and counters" rankings, which are based on the events of a given battle, are weighted by the lower weight of the two players.
- Will this stop Molk? For those with short memories, a few tier updates ago, Molk managed to get Metang into RU (up from PU) by making a decent team built around Metang. I understand the team did decently well. This was back in the PO days, so we don't know what his ranking would've been, but it seems likely it would've been a bit above 1500. Thus, this rating system would not have prevented Molk from getting Metang into RU. The difference here between what Molk was doing and what our recent "tier troll" did is what Molk's team was actually viable. So if you were playing RU at that time, and you happened to have no way of dealing with Metang, you would've legitimately been in trouble.
Last edited: