Hello, everyone.
As explored in previous Policy Review threads, Pokémon Showdown's ladder has a problem where too many experienced people make new fresh accounts. This misleads the ladder system into thinking they are actually bad players, and they therefore matching them against actual bad players.
This obviously results in a terrible experience for the actual newbie. What about the good player? Well, in many games, good or averagely-skilled players smurf for personal reasons, e.g. because they want to beat people they know they are better than. However, PS has a systemic smurfing problem, in that many Smogon activities straight up require veterans creating new accounts. And these veterans do not have the best time playing in the low-ladder.
So, there are two factors involved
1. The PS ladder system
2. Systems defined in this forum for tiering and tournament purposes, which partially interact with (1). These include suspect tests and OLT.
This thread intends to focus on the first factor. So, how does the ladder work? Quoting from the Pokémon Showdown website:
A very streamlined summary. Some takeaways and/or complementary information:
- Elo is a single number. Every player starts at 1000 Elo. If you win it's fine, you get to ascend the ladder. If you lose, you lose points, unless you are already at 1000 Elo. It cannot go below 1000. This is what is called a rating floor - not contemplated in the original design of Elo for Chess.
- Glicko-1 is actually two numbers. Every player starts at 1500±130 Glicko. The first number is your score. The second number ("uncertainty" or "deviation") means how doubtful the system is that you have already reached your skill level.
- Glicko-1's uncertainty decreases the more you play, and naturally increases with the clock. When the system is certain that you have reached your intended skill level, it becomes harder to both increase or decrease your score. However, when enough time has passed, your Glicko-1 uncertainty will automatically increase with no action needed.
- GXE is a single number computed from both Glicko-1 numbers. It ranges from 0% to 100%, so it's useful to measure a player's overall performance without context.
- COIL is a single number computed from your GXE and your matches played. It increases with both, but your GXE mathematicaly limits how high your COIL can be. That is, no matter how many games someone plays, if their GXE is not good enough, they will never be able to reach a certain COIL goal set in a suspect test.
- Pokémon Showdown's ladder only cares about Elo.
- Glicko-1 is the mathematical base of GXE, which is itself the competitive part of COIL, which is used for getting suspect requirements.
- Glicko-1 mathematical base is very different and independent from Elo.
So, historically, Smogon has been more and more about Glicko-1 and less about Elo. Back in 2009, X-Act even suggested using GXE for Smogon leaderboards.
On the other hand, PS originally used ACRE, the rating system criticized by X-Act when he introduced GXE. ACRE was also a single number computed from Glicko-1. However, it required a huge grind in order to raise in the ladder.
Since ACRE fell off favor, it got replaced as PS primary rating system. But not by GXE, as X-Act proposed. It turns out that GXE is an unfamiliar laddering system for the uninitiated (it's very simple, though! who doesn't know percentages?). Players want to feel good when they see their rating raise from 1200 to 1300. What's the meaning of raising from 55.3% to 55.4%? "Wait, does that mean that there are 44.6% people better than me? I suck at this.". What about straight up using Glicko-1? Well, unless they come from an engineering or sciences background, most players will fright at the sight of this ± sign. Purists will also not allow Glicko-1 score to be divorced from its deviation. COIL? Not invented yet, and also not its purpose.
In general, there are two types of ladder systems: skill-based systems and progression-based systems. SBMM (skill-based matchmaking) purports to result in the maximal satisfaction for all players involved by letting them play fair matches. Original Elo is a purely skill-based system, and, in principle Elo is the rating system everyone uses. However, everyone also dislikes its simplicity, modding it in all sorts of ways.
Therefore, PS has modded it with two extra "features": a rating floor, and a variable K-factor. Nevertheless, both rating floor and variable K-factors are progression-oriented. Now, it turns out you can't have your cake and eat it too, so these Elo distorsions severely harms the experience of players in the 1000 Elo range. For example, take a look at the following two anonymous but real players:
- Player A: 1000 Elo, 1500±130 (provisional) Glicko-1, and undefined GXE
- Player B: 1000 Elo, 1066±83 Glicko-1, 9.6% GXE.
If you understood anything from the above rating systems explanations, you will know that these two players are in fact made of entirely different materials. However, the current PS matchmaking system will happily match Player A against Player B together. No matter whether Player A is the world champion in a new alt, or if he is just a very average player. These matches are bad not only for the actually bad player, but also for veterans that create new accounts for suspect tests, OLT, testing new teams, or whatever.
If you got down here, you will notice that I have bolded some phrases. One of them was primary rating system. What does this mean? Well, you obviously can go to the ladder page or type /rank and you will see more than one rating system. So Elo is the most used, and the rest are just for show, right? Or the rest are for players to keep track of them in their own Excel spreadsheet, and hopefully report for suspect voting? Well, no. In fact, COIL is also being tracked, but only for tiers running suspect tests. Ok, so what exactly makes Elo the primary rating system in Pokémon Showdown?
1. Ladder scoreboards are sorted by Elo. You are first place in the ladder if your Elo is the highest, not if your GXE or COIL is highest.
2. As soon as you win a battle, your Elo updates and you get instant feedback of your progression. Instant feedback is good, right?
3. When clicking Look for a Battle, you get matched with any player that is already looking for a battle and within a certain Elo difference, so called match-making range or MMR.
Ladder scoreboards being Elo-sorted and PS's Elo implementing an inactivity decay factor are two sides of the same coin, and is totally working as intended.
However, since Pokémon Showdown's implementation of Elo is not purely skill-based, the matchmaking process is distorted, resulting in cases like the above among 1000 Eloers.
This finally leads to the central proposal of this thread: can these 3 components of Pokémon Showdown's rating be split, so that we keep Elo exclusively for its good parts, and use another of our rating systems for skill-based matchmaking? Games such as Overwatch, League of Legends, Rocket League, Dota 2 and CS:GO already do this. I'll be clear. In contrast to the first thread I linked, which I highly suggest reading, this is not a technical question. This is a policy proposal.
Let's switch PS ladder matchmaking to Glicko-1, with added constraints for rating uncertainty, so that new accounts match against new accounts, and old accounts match against old accounts, while also keeping Elo for PS scoreboards and battle score updates feedback.
As explored in previous Policy Review threads, Pokémon Showdown's ladder has a problem where too many experienced people make new fresh accounts. This misleads the ladder system into thinking they are actually bad players, and they therefore matching them against actual bad players.
This obviously results in a terrible experience for the actual newbie. What about the good player? Well, in many games, good or averagely-skilled players smurf for personal reasons, e.g. because they want to beat people they know they are better than. However, PS has a systemic smurfing problem, in that many Smogon activities straight up require veterans creating new accounts. And these veterans do not have the best time playing in the low-ladder.
So, there are two factors involved
1. The PS ladder system
2. Systems defined in this forum for tiering and tournament purposes, which partially interact with (1). These include suspect tests and OLT.
This thread intends to focus on the first factor. So, how does the ladder work? Quoting from the Pokémon Showdown website:
How the ladder works
Our ladder displays four ratings: Elo, GXE, Glicko-1, and COIL.
Elo is the main ladder rating. It's a pretty normal ladder rating: goes up when you win and down when you lose.
GXE (Glicko X-Act Estimate) is an estimate of your win chance against an average ladder player.
Glicko-1 is a different rating system. It has rating and deviation values.
COIL (Converging Order Invariant Ladder) is mainly used for suspect tests. It goes up as you play games, but not too many games.
Note that win/loss should not be used to estimate skill, since who you play against is much more important than how many times you win or lose. Our other stats like Elo and GXE are much better for estimating skill.
A very streamlined summary. Some takeaways and/or complementary information:
- Elo is a single number. Every player starts at 1000 Elo. If you win it's fine, you get to ascend the ladder. If you lose, you lose points, unless you are already at 1000 Elo. It cannot go below 1000. This is what is called a rating floor - not contemplated in the original design of Elo for Chess.
- Glicko-1 is actually two numbers. Every player starts at 1500±130 Glicko. The first number is your score. The second number ("uncertainty" or "deviation") means how doubtful the system is that you have already reached your skill level.
- Glicko-1's uncertainty decreases the more you play, and naturally increases with the clock. When the system is certain that you have reached your intended skill level, it becomes harder to both increase or decrease your score. However, when enough time has passed, your Glicko-1 uncertainty will automatically increase with no action needed.
- GXE is a single number computed from both Glicko-1 numbers. It ranges from 0% to 100%, so it's useful to measure a player's overall performance without context.
- COIL is a single number computed from your GXE and your matches played. It increases with both, but your GXE mathematicaly limits how high your COIL can be. That is, no matter how many games someone plays, if their GXE is not good enough, they will never be able to reach a certain COIL goal set in a suspect test.
- Pokémon Showdown's ladder only cares about Elo.
- Glicko-1 is the mathematical base of GXE, which is itself the competitive part of COIL, which is used for getting suspect requirements.
- Glicko-1 mathematical base is very different and independent from Elo.
So, historically, Smogon has been more and more about Glicko-1 and less about Elo. Back in 2009, X-Act even suggested using GXE for Smogon leaderboards.
On the other hand, PS originally used ACRE, the rating system criticized by X-Act when he introduced GXE. ACRE was also a single number computed from Glicko-1. However, it required a huge grind in order to raise in the ladder.
Since ACRE fell off favor, it got replaced as PS primary rating system. But not by GXE, as X-Act proposed. It turns out that GXE is an unfamiliar laddering system for the uninitiated (it's very simple, though! who doesn't know percentages?). Players want to feel good when they see their rating raise from 1200 to 1300. What's the meaning of raising from 55.3% to 55.4%? "Wait, does that mean that there are 44.6% people better than me? I suck at this.". What about straight up using Glicko-1? Well, unless they come from an engineering or sciences background, most players will fright at the sight of this ± sign. Purists will also not allow Glicko-1 score to be divorced from its deviation. COIL? Not invented yet, and also not its purpose.
In general, there are two types of ladder systems: skill-based systems and progression-based systems. SBMM (skill-based matchmaking) purports to result in the maximal satisfaction for all players involved by letting them play fair matches. Original Elo is a purely skill-based system, and, in principle Elo is the rating system everyone uses. However, everyone also dislikes its simplicity, modding it in all sorts of ways.
Therefore, PS has modded it with two extra "features": a rating floor, and a variable K-factor. Nevertheless, both rating floor and variable K-factors are progression-oriented. Now, it turns out you can't have your cake and eat it too, so these Elo distorsions severely harms the experience of players in the 1000 Elo range. For example, take a look at the following two anonymous but real players:
- Player A: 1000 Elo, 1500±130 (provisional) Glicko-1, and undefined GXE
- Player B: 1000 Elo, 1066±83 Glicko-1, 9.6% GXE.
If you understood anything from the above rating systems explanations, you will know that these two players are in fact made of entirely different materials. However, the current PS matchmaking system will happily match Player A against Player B together. No matter whether Player A is the world champion in a new alt, or if he is just a very average player. These matches are bad not only for the actually bad player, but also for veterans that create new accounts for suspect tests, OLT, testing new teams, or whatever.
If you got down here, you will notice that I have bolded some phrases. One of them was primary rating system. What does this mean? Well, you obviously can go to the ladder page or type /rank and you will see more than one rating system. So Elo is the most used, and the rest are just for show, right? Or the rest are for players to keep track of them in their own Excel spreadsheet, and hopefully report for suspect voting? Well, no. In fact, COIL is also being tracked, but only for tiers running suspect tests. Ok, so what exactly makes Elo the primary rating system in Pokémon Showdown?
1. Ladder scoreboards are sorted by Elo. You are first place in the ladder if your Elo is the highest, not if your GXE or COIL is highest.
2. As soon as you win a battle, your Elo updates and you get instant feedback of your progression. Instant feedback is good, right?
3. When clicking Look for a Battle, you get matched with any player that is already looking for a battle and within a certain Elo difference, so called match-making range or MMR.
Ladder scoreboards being Elo-sorted and PS's Elo implementing an inactivity decay factor are two sides of the same coin, and is totally working as intended.
However, since Pokémon Showdown's implementation of Elo is not purely skill-based, the matchmaking process is distorted, resulting in cases like the above among 1000 Eloers.
This finally leads to the central proposal of this thread: can these 3 components of Pokémon Showdown's rating be split, so that we keep Elo exclusively for its good parts, and use another of our rating systems for skill-based matchmaking? Games such as Overwatch, League of Legends, Rocket League, Dota 2 and CS:GO already do this. I'll be clear. In contrast to the first thread I linked, which I highly suggest reading, this is not a technical question. This is a policy proposal.
Let's switch PS ladder matchmaking to Glicko-1, with added constraints for rating uncertainty, so that new accounts match against new accounts, and old accounts match against old accounts, while also keeping Elo for PS scoreboards and battle score updates feedback.
Last edited: