Weighted Stats FAQ

Antar · Feb 1, 2013

Hello, pokefans! At this point you've no doubt noticed that this month's usage stats look a bit different. That's because, instead of just counting the number of times each Pokemon appears on a team, our stats are now weighted by a system that takes into account a player's ranking. The point of this thread is to explain our system and to answer any questions any of you might have.

I am not interested in your opinions on whether or not this weighted stat system is a good thing. Any posts along those lines will be deleted.

tl;dr--if you're a "good" player (meaning you're not a newbie, and you're not using a troll team), our new weighting system will not noticeably affect how your teams are counted. Beyond a fairly low cutoff, having a higher rating will not lead to your teams being "counted more." The purpose of this weighting system is to lower how much bad, and especially deliberately bad teams affect our stats.

What are usage stats? Why are they important?
Let's start simple: the usage statistics we publish each month reflects the Pokemon that are used on our Pokemon Showdown simulator server. Each rated battle is logged, and these logs contain full team data (not just the names of the Pokemon but also their EV spreads, movesets, etc.) as well as a log of the battle itself.

Each month, we distill these logs into (what we hope is) useful information: moveset statistics, metagame analyses, and, most importantly, usage statistics, which report how often a given Pokemon is used throughout the month. Why is this information important? Well, I hope the moveset and metagame information is useful for players trying to get into a new metagame or trying to improve on their teams (for instance, if your team is having trouble with Ferrothorn, it might be useful for you to go into the moveset data and see that--statistically--the best counters for Ferrothorn are Volcarona, Heatran and Torkoal).

Usage statistics are even more important, as we use them to determine our tiers.

What does OU mean?
One of the grandest and most controversial questions we have in this community is what our tier list represents. Is an OU Pokemon more powerful than one in a lower tier? Is Gastrodon a "better" Pokemon than Victini or Zapdos?

My personal answer to this question is no: our tiers do not do a good job of ranking a Pokemon's "power." So if a UU Pokemon isn't inherently better than one in RU, what's the point of tiers?

Again, this is a controversial subject, but my answer is that OU means what it stands for, that these Pokemon are simply "overused," and that the primary function of tiers is as threat lists. To elaborate, I'm going to point you folks to the original defining of our current OU-UU cutoff: in short, a Pokemon is OU if, in playing 20 battles, there's at least a 50% chance of you encountering that Pokemon at least once. This is an acknowledgement of the fact that there are 649 Pokemon out there--if you're designing a team of six Pokemon, it's unlikely that you're going to be able to make sure that your team has a way of dealing with each and every Pokemon out there. But if you're making an OU team, you probably will never have to worry if your team gets completely wrecked by Leavanny, since it doesn't even appear on one team in a thousand. What the OU/UU cutoff literally says is: "if said Pokemon is UU or below, you still have a good shot of going 20-0 even if your team is super weak to that Pokemon."

So keep that philosophy in mind for the rest of this article.

The old system and why it needed to be fixed
So tiers are threatlists, and we determine tiers through usage statistics. The way we did this in the past was simple: we simply counted up all the Pokemon that appeared on an OU team in a given month, then, after three months, we combined usage statistics, weighting the most recent month 5/6ths, the month before 1/8th and the month before that 1/24th, and if the combined percentage was greater than 3.406367107%, we declared that Pokemon OU (repeat the process for UU to get RU, RU to get NU, and NU to get the unofficial PU tier).

I should note at this point that not all teams were counted under the old system. If a battle lasted less than six turns, it was thrown out. The idea here was to make it harder for people to "spam the stats" by forfeiting immediately and then looking for a new battle. It also means that if you selected the wrong tier by mistake or made a mistake in your teambuilding, and you decide not to go through with the battle (for obvious reasons), then your team doesn't get counted towards the usage stats.

All that was well and good in the PO days, when you needed to download and install a program, then select a server that wasn't the default, in order to play on our ladders. Pokemon Showdown, however, has a much lower barrier to entry than Pokemon Online, and Smogon is also the default server. This means that we see tons more activity than we used to (all in all a good thing), but it also means we see a ton more inexperienced and "casual" players. And under the old system, the battles of someone using Ash's team from the anime are weighted the same as the battles of our most skilled players.

Is this a bad thing? I remind you of the philosophy behind our tiers: they're threatlists. It doesn't matter if Pikachu is everywhere in the metagame, you still don't need to know how to counter it if the only people using it think you can take out a Rhydon by "aiming for its horn."

I wish I were exaggerating, but you can actually look this up in the moveset statistics:

The average "weight" of a player using Pikachu in OU is 0.239 (I'll get into what that means in the a few sections) and its top teammates are, in order Charizard, Blastoise, Venusaur, Snorlax and Lapras.

Of course, Pikachu isn't about to go OU, but Charizard lives fairly close to the cutoff, and there are plenty of other Pokemon whose usages are boosted disproportionately by inexperienced users who aren't really a threat to any decent player.

There's a separate, more important issue that led us to abandon the old system: spammers who deliberately try to manipulate the tiers. While the old system already tried to make "spamming the tiers" difficult by discarding early forfeits, the bottom line is that we could do nothing about a player who had a team of six Pokemon, all with useless items and awful movesets, who lost his battles without forfeiting.

What we want in a stat system
So if the old system is broken, we're left with the issue of what we do to replace it. Thinking on it for a while, I came up with the following criteria for any new stat system I could come up with:

The system should result in tier lists that correspond to the principle that they function, first and foremost, as threat lists.
It should be difficult to game.
The contribution from bad players and trolls should be minimized. This is the first component of (2).
Decent players and good player should contribute strongly to the stats.
The difference in weighting between a decent player and a good player should be relatively small--we're not looking for "1337" stats. This is also tied to (2), as there should be no advantage in trying to game the ladder, at least where tiering is concerned.
The system should make some sort of logical sense beyond the empirical "because it works."

I dwelled on these criteria for a good long while, and after a lot of careful thought, I came up with the following principle:

A pokemon should be OU if it appears a sufficient number of times on teams used by players who are better than average.

You might think that this concept of "better than average" would be difficult to define, but it's really not--Pokemon Showdown rates players based on a system called Glicko2. In contrast to PO's Elo system, your Glicko2 rating is actually two numbers: a rating (R) and a deviation (RD). The idea is that it's impossible to know a player's "true" rating, but we can guess that that rating falls within a probability distribution, in this case, a normal distribution of width RD and center R.

Below are two sample distributions: one for a player of rating 1600±100 and one whose rating is 1575±50.

For those unfamiliar with probability distributions, the idea is that the probability of a player's "true rating" being in the infinitesimal interval r<rating<r+dr is f(r)dr.

So looking at these two graphs, the question is who's the better player (assuming a player is "better" than another if his or her "true rating" is greater)? Using the Glicko2 system, it's impossible to know for certain, but what we can determine is the likelihood that one player is better than another player.

Which brings us back to the idea of the average player. Not to get too technical, but with our rating system, each player starts off with a rating of 1500 and a deviation of 350.

"But wait, Antar," I hear you say. "My starting rating was 1000, not 1500.. Alas, you're confusing your rating with your ACRE, which is a conservative rating estimate. CREs basically say, "I don't know for certain how good a player is, but I'm pretty sure he or she is better than this. Specifically, there's about a 92% chance that your true rating is better than your ACRE. Okay. Glad I got that out of the way. When I'm talking ratings here, I'm talking Glicko2, not ACRE.

Okay, so each player starts off with a rating of 1500. As they battle, their rating goes up when they win, down when they lose. I'm not going to get into the specifics. The bottom line is, at the end of the day, the distribution of the ratings of all the players should pretty much be centered around 1500. And thus, we define the average player to have a true rating of 1500.

Our new weighting system
So with our stated premise that we're looking only at teams used by players who are better than average, and with our definition of average, the obvious solution is to throw out teams by players whose rating is less than or equal to 1500. While this is certainly simple enough, it has the problem that, in the Glicko2 system, we don't actually know a player's true rating. We only know the probability distribution corresponding to their rating.

But all is not lost, because it's relatively simple to figure out what the likelihood is that a player's true rating is greater than 1500: it's just the integral of their rating distribution from 1500 to infinity, which works out to be

P(r>1500)=(1+erf((rating-1500)/(sqrt(2)*deviation)))/2

It is this probability that we will be using to weight our stats

To get a feel for how this weighting system works, here are some sample calculations:

Here are some sample calculations:

A brand new player, just starting out, whose rating is 1500±350 will be weighted 0.5.
I have a mediocre OU team i sometimes play on PS under an alt. It currently has a provisional rating of 1576±105. Its weighting is 0.77.
The person who I just demolished with that team (you have to be pretty bad...) has a provisional rating of 1394±139. Weighting is 0.223
My good (not great) OU team has a rating of 1946±177. Weighting is 0.994
The person at the top of the ladder right now has a rating of 2120±55. Their weighting is 1.0, for all intents and purposes.

So this system has the properties I was looking for: bad player only counted at about 20% or less, while mediocre teams counts almost 80% and good teams--no matter how good--are pretty much fully counted.

Put another way, here's a graph showing what percentages of players make up what percentage of the stats (this is for OU, for the first half of January)

To summarize:

1% of teams make up 1.76% of stats (ratio 1.76:1)
5% of teams make up 8.79% of stats (ratio: 1.76:1)
10% of teams make up 17.58% of stats (ratio: 1.76:1)
20% of teams make up 35.16% of stats (ratio: 1.76:1)
50% of teams make up 80.80% of stats (ratio: 1.61:1)
77.53% of teams make up 99% of stats (ratio: 1.27:1)

So it doesn't matter if you're in the top 1% or the top 20--your contibution to the stats is basically the same. But then, if you're in the bottom quarter of the tier (meaning you're either really bad or a troll), your teams contribute basically not at all to the usage stats.

As it should be.

Frequently Asked Questions

Why not just count winning teams? Wouldn't that accomplish the same thing? Well, yes and no. First off, I'm not asking what the likelihood is of a player defeating another one, but rather that the likelihood that the player is better. The distinction is dense and mathematical, and if you're curious, the probability that you will win a given match against a random opponent is given by your GXE. But beyond that, there's a bigger problem: PS will try to match you with someone with a rating close to yours. Meaning if your rating is 2000, it's much less likely that you'll be paired with someone of rating 1000, than if your rating is 1200. So, at the end of the day, race-to-the-bottom players will get paired with trolls, so half of their teams will be counted, while half of the teams at the top are counted as well, and we're little better off than when we started.
Which rating are you using? Before the battle? After the battle? At the end of the month? Provisional or actual? With an eye towards making the system harder to game, we're using provisional ratings, logged at the end of the battle.
What happens if there's a ladder error and the rating doesn't get recorded? Originally, we treated you and your opponent as completely unknown quantities, so your teams were each weighted 0.5, the same as a new player just joining the ladder or a hypothetical player with RD of infinity. But that method ignored a key data point: the outcome of the battle, itself. A player with exactly one win and no losses will have a rating of 1662±290, and a player with exactly one loss and no wins will have a rating of 1338±290. So it's these ratings we use if no other is available (they work out to a weighting of 0.71 for the winner and 0.29 for the loser). This is far from ideal, but the ladder error rate is pretty low, and Zarel and others are working on lowering it further.
Could this same principle be applied to make some sort of "1337" stats? Yes, yes it could! I have experimented with looking at statistics resulting from considering the probability that a player's rating is greater than 1850 (one standard deviation above average) and above 2200 (two standard deviations above average). Interestingly, they don't look much different than the regular stats, which is why I didn't include them in January's stats thread. But due to popular demand, I did decide to "1850" stats in February's stats and probably will continue to provide them in future months.
Will metagame and moveset stats be weighted? Absolutely. The idea behind these additional statistics is to provide players with references for set- and team-building. It makes sense to minimize the impact of bad and malicious players on those statistics. Note that "checks and counters" rankings, which are based on the events of a given battle, are weighted by the lower weight of the two players.
Will this stop Molk? For those with short memories, a few tier updates ago, Molk managed to get Metang into RU (up from PU) by making a decent team built around Metang. I understand the team did decently well. This was back in the PO days, so we don't know what his ranking would've been, but it seems likely it would've been a bit above 1500. Thus, this rating system would not have prevented Molk from getting Metang into RU. The difference here between what Molk was doing and what our recent "tier troll" did is what Molk's team was actually viable. So if you were playing RU at that time, and you happened to have no way of dealing with Metang, you would've legitimately been in trouble.

MicfiJasan · Feb 3, 2013

Antar said:
The old system and why it needed to be fixed
So tiers are threatlists, and we determine tiers through usage statistics. The way we did this in the past was simple: we simply counted up all the Pokemon that appeared on an OU team in a given month, then, after three months, we combined usage statistics, weighting the most recent month 5/6ths, the month before 1/8th and the month before that 1/24th, and if the combined percentage was greater than 3.406367107%, we declared that Pokemon OU (repeat the process for UU to get RU, RU to get NU, and NU to get the unofficial PU tier).

Quick clarification question: Does this mean that you're changing how the three-month stats are generated as well? Or are you keeping that process the same and just using the weighted stats?

Antar · Feb 3, 2013

MicfiJasan said:
Quick clarification question: Does this mean that you're changing how the three-month stats are generated as well? Or are you keeping that process the same and just using the weighted stats?

No, that part will stay the same, just using the weighted stats instead of unweighted.

PrimalTaylorSwift · Feb 8, 2013

Will this bring about more suspect testing(this month), specifically between OU and Ubers?

Antar · Feb 8, 2013

iHeartlessHero1337 said:
Will this bring about more suspect testing(this month), specifically between OU and Ubers?

As far as I know, no. These stats should have no effect on suspect tests.

Tassa · May 1, 2013

Ok, this thread has been posted two months ago, but after this time, wondering about how will the 1850 April stats looks like, I came here.
I felt that the numerous bad ranked player end to count more than those who have really 1850, even if less than in non weighted stats.

I noted that :
-this is called "1850 stats" when what is used is :

[the] likelihood that a player's true rating is greater than 1500

I feel that I wasn't the only one to think "1850 stats -> weighted with the likelihood to be over 1850", so though I know that this name come from "1500 + 1 deviation" I think it doesn't fit.
Imo, calling it "1500 stats" would be better.

-these stats are often seen as reflecting what good players use but :

A brand new player, just starting out, whose rating is 1500±350 will be weighted 0.5.

This let me toughtful.
If we keep in mind that this is "1500 stats", we understand from where it come, but the fact is that it decreases a lot the quality with all new alts spamming craps before getting down.

After these two point, I would suggest, bar renaming "1850 stats" into "1500 stats", to give use " 'real' 1850 stats" (real is maybe not the good word, but you understood) with the likelyhood of the player being over 1850, to see what is really played on high ladder.
I don't know how much work it would take to be fair, not too much I hope, but it would definitely be very interesting.

After having realized that these stats take in account most players, I returned on this :

I remind you of the philosophy behind our tiers: they're threatlists. It doesn't matter if Pikachu is everywhere in the metagame, you still don't need to know how to counter it if the only people using it think you can take out a Rhydon by "aiming for its horn."

If we want threatlists, wouldn't it be more accurate to see what serious players play ?
Bottom of the ladder is mainly made of people spamming crap for fun, and if they don't surprise you with extreme gimmick (in sets, new pokes should be played around) and haxx, you're nearly sure to beat them.
So what these people play really doesn't matter, you don't need to be aware of 1% of Charizard.

So taking these games in account made the usage bad threatlists.
If you want to play a little bit seriously, you'll never face these people spamming crap for fun, and you'll face, be you very good or average, not the same mons that these spammed on bottom.

Will you prepare in the same way while teambuilding using OU stats or weighted stats (which represent what you're more likely to face) ? Surely no.
I'll take a few random examples from March stats : Ammonguss is at 1,3% in standard stats, at 2,2% in weigthed stats, Landorus at 9% in standard, 12% in weighted, Keldeo at 10,4% vs 14,4%, or Latias at 7,7% vs 10%.

A last quick point is about the UU/OU cutoff, which is directly related.

What the OU/UU cutoff literally says is: "if said Pokemon is UU or below, you still have a good shot of going 20-0 even if your team is super weak to that Pokemon."

So wouldn't be the stats weighted with the likelyhood to be over 1500 better for that ?
And when you see that the more you'll win, the higher you'lll be ranked, this statement seems to really misfit what the cutoff from standards stats represents.

[I know that this part start to tell about something else than rating, but I feel it's related, though a new thread would probably be better if we start to debate about that]

Thanks for reading.

tl;dr : "1850 stats" should be renamed "1500 stats", would be interesting to replace standards stats for the UU/OU cutoff, and stats weighted with the likelyhood being over 1850 would be very cool.

Antar · May 3, 2013

Tassa said:
I felt that the numerous bad ranked player end to count more than those who have really 1850, even if less than in non weighted stats.

I've done detailed numerical analysis of which-players-with-what-ratings-contribute-how-much-to-the-stats, and trust me--bad (and even average) players contribute not even a significant fraction of a percent to the 1850 stats. The weighting function is basically a Heaviside.

And I can't figure out what the hell you're trying to say in the second part.

Tassa · May 10, 2013

Sorry for the time to answer, and if my first post was unclear (I'm not especially good in english so I may not have noticed that some things wasn't because of that, too)

Basically, the first part is to say that '1850 stats' is a name that induce in a false thinking about them, and that the fact that a new alt at his first try we'll have half of his fight counted doesn't help at all to trust them.
The second says that, if stats are threatlists (and thus used for defining OU), '1850 stats' would be better at this job since they don't count most of "crap spamming for fun" in bottom of ladder, while reflecting what average player do.

For the numerical analysis, I guess that if you say that this is true, but without seeing what would be the stats with the likelyhood of being over 1850 (or 1600/1700...) it is hard to trust that it would be very very similar to these with te likelyhood being over 1500. After I don't know how much work it is to produce them, but for sure it would be very interesting.

ChaosAkita · May 11, 2013

How does this change how we should interpret the monthly statistics? Does "RAW" refer to the unweighted usage and "REAL" for the weighted usage? Or am I missing something here?

Antar · Jul 6, 2013

ChaosAkita said:
How does this change how we should interpret the monthly statistics? Does "RAW" refer to the unweighted usage and "REAL" for the weighted usage? Or am I missing something here?

Raw is unweighted. Real is unweighted only counting Pokemon that acutally appear in battle.

Disaster Area · Oct 29, 2013

Sorry this might be the wrong place to ask, but with gen 6 here, will we be using the weighted stats or just pure usage stats for the sorting of gen 6 tiers. If we start with the weighted stats it might be more forward-thinking. I've enjoyed following the logic of these posts; also I hope I don't get in any trouble for reviving or bumping an old thread, I think what I'm asking is relevant. Also how soon should we have october's gen 6 usage statistics (I'm new to smogon [not PO however] so I don't know how to find out these things yet)

Antar · Oct 29, 2013

Yes, we will continue to use weighted stats for the 6th-gen tiers. Usage stats get published on the 1st or 2nd of every month (depending on how long it takes them to process).

Disaster Area · Oct 29, 2013

Will the weighted stats be used to sort UU from OU, or just the regular usage stats? I think that's what I meant to say before but I didn't say it with any clarity xD
But thanks very much for your time and help.

Antar · Oct 29, 2013

Yes. We will use the same system for determining tiers. Keep in mind that UU will not be established for quite some time, since first we need to sort out OU vs. Ubers. Once OU has "settled down" we will establish a UU list using the weighted usage stats from the most recent three months, the same way as we've done before. Note that it'll probably be about a year and a half (if not longer) before we get the full tier list (as in RU, NU, hypothetically PU). If you're itching for non-"standard" matches in the meantime, though, keep in mind that Doubles is an official tier now, and there's always Little Cup.

Weighted Stats FAQ

Antar

MicfiJasan

Antar

PrimalTaylorSwift

Antar

Tassa

Antar

Tassa

ChaosAkita

Antar

Disaster Area

formerly Piexplode

Antar

Disaster Area

formerly Piexplode

Antar