• Snag some vintage SPL team logo merch over at our Teespring store before January 12th!

Proposal Rating uncertainty in PS ladder matches (aka Glicko-1 against the smurfing problem)

Hello, everyone.

As explored in previous Policy Review threads, Pokémon Showdown's ladder has a problem where too many experienced people make new fresh accounts. This misleads the ladder system into thinking they are actually bad players, and they therefore matching them against actual bad players.

This obviously results in a terrible experience for the actual newbie. What about the good player? Well, in many games, good or averagely-skilled players smurf for personal reasons, e.g. because they want to beat people they know they are better than. However, PS has a systemic smurfing problem, in that many Smogon activities straight up require veterans creating new accounts. And these veterans do not have the best time playing in the low-ladder.

So, there are two factors involved
1. The PS ladder system
2. Systems defined in this forum for tiering and tournament purposes, which partially interact with (1). These include suspect tests and OLT.

This thread intends to focus on the first factor. So, how does the ladder work? Quoting from the Pokémon Showdown website:

How the ladder works​

Our ladder displays four ratings: Elo, GXE, Glicko-1, and COIL.

Elo is the main ladder rating. It's a pretty normal ladder rating: goes up when you win and down when you lose.

GXE (Glicko X-Act Estimate) is an estimate of your win chance against an average ladder player.

Glicko-1 is a different rating system. It has rating and deviation values.

COIL (Converging Order Invariant Ladder) is mainly used for suspect tests. It goes up as you play games, but not too many games.

Note that win/loss should not be used to estimate skill, since who you play against is much more important than how many times you win or lose. Our other stats like Elo and GXE are much better for estimating skill.

A very streamlined summary. Some takeaways and/or complementary information:
- Elo is a single number. Every player starts at 1000 Elo. If you win it's fine, you get to ascend the ladder. If you lose, you lose points, unless you are already at 1000 Elo. It cannot go below 1000. This is what is called a rating floor - not contemplated in the original design of Elo for Chess.
- Glicko-1 is actually two numbers. Every player starts at 1500±130 Glicko. The first number is your score. The second number ("uncertainty" or "deviation") means how doubtful the system is that you have already reached your skill level.
- Glicko-1's uncertainty decreases the more you play, and naturally increases with the clock. When the system is certain that you have reached your intended skill level, it becomes harder to both increase or decrease your score. However, when enough time has passed, your Glicko-1 uncertainty will automatically increase with no action needed.
- GXE is a single number computed from both Glicko-1 numbers. It ranges from 0% to 100%, so it's useful to measure a player's overall performance without context.
- COIL is a single number computed from your GXE and your matches played. It increases with both, but your GXE mathematicaly limits how high your COIL can be. That is, no matter how many games someone plays, if their GXE is not good enough, they will never be able to reach a certain COIL goal set in a suspect test.

- Pokémon Showdown's ladder only cares about Elo.
- Glicko-1 is the mathematical base of GXE, which is itself the competitive part of COIL, which is used for getting suspect requirements.
- Glicko-1 mathematical base is very different and independent from Elo.


So, historically, Smogon has been more and more about Glicko-1 and less about Elo. Back in 2009, X-Act even suggested using GXE for Smogon leaderboards.
On the other hand, PS originally used ACRE, the rating system criticized by X-Act when he introduced GXE. ACRE was also a single number computed from Glicko-1. However, it required a huge grind in order to raise in the ladder.

Since ACRE fell off favor, it got replaced as PS primary rating system. But not by GXE, as X-Act proposed. It turns out that GXE is an unfamiliar laddering system for the uninitiated (it's very simple, though! who doesn't know percentages?). Players want to feel good when they see their rating raise from 1200 to 1300. What's the meaning of raising from 55.3% to 55.4%? "Wait, does that mean that there are 44.6% people better than me? I suck at this.". What about straight up using Glicko-1? Well, unless they come from an engineering or sciences background, most players will fright at the sight of this ± sign. Purists will also not allow Glicko-1 score to be divorced from its deviation. COIL? Not invented yet, and also not its purpose.

In general, there are two types of ladder systems: skill-based systems and progression-based systems. SBMM (skill-based matchmaking) purports to result in the maximal satisfaction for all players involved by letting them play fair matches. Original Elo is a purely skill-based system, and, in principle Elo is the rating system everyone uses. However, everyone also dislikes its simplicity, modding it in all sorts of ways.

Therefore, PS has modded it with two extra "features": a rating floor, and a variable K-factor. Nevertheless, both rating floor and variable K-factors are progression-oriented. Now, it turns out you can't have your cake and eat it too, so these Elo distorsions severely harms the experience of players in the 1000 Elo range. For example, take a look at the following two anonymous but real players:

- Player A: 1000 Elo, 1500±130 (provisional) Glicko-1, and undefined GXE
- Player B: 1000 Elo, 1066±83 Glicko-1, 9.6% GXE.

If you understood anything from the above rating systems explanations, you will know that these two players are in fact made of entirely different materials. However, the current PS matchmaking system will happily match Player A against Player B together. No matter whether Player A is the world champion in a new alt, or if he is just a very average player. These matches are bad not only for the actually bad player, but also for veterans that create new accounts for suspect tests, OLT, testing new teams, or whatever.

If you got down here, you will notice that I have bolded some phrases. One of them was primary rating system. What does this mean? Well, you obviously can go to the ladder page or type /rank and you will see more than one rating system. So Elo is the most used, and the rest are just for show, right? Or the rest are for players to keep track of them in their own Excel spreadsheet, and hopefully report for suspect voting? Well, no. In fact, COIL is also being tracked, but only for tiers running suspect tests. Ok, so what exactly makes Elo the primary rating system in Pokémon Showdown?

1. Ladder scoreboards are sorted by Elo. You are first place in the ladder if your Elo is the highest, not if your GXE or COIL is highest.
2. As soon as you win a battle, your Elo updates and you get instant feedback of your progression. Instant feedback is good, right?
3. When clicking Look for a Battle, you get matched with any player that is already looking for a battle and within a certain Elo difference, so called match-making range or MMR.

Ladder scoreboards being Elo-sorted and PS's Elo implementing an inactivity decay factor are two sides of the same coin, and is totally working as intended.

However, since Pokémon Showdown's implementation of Elo is not purely skill-based, the matchmaking process is distorted, resulting in cases like the above among 1000 Eloers.

This finally leads to the central proposal of this thread: can these 3 components of Pokémon Showdown's rating be split, so that we keep Elo exclusively for its good parts, and use another of our rating systems for skill-based matchmaking? Games such as Overwatch, League of Legends, Rocket League, Dota 2 and CS:GO already do this. I'll be clear. In contrast to the first thread I linked, which I highly suggest reading, this is not a technical question. This is a policy proposal.

Let's switch PS ladder matchmaking to Glicko-1, with added constraints for rating uncertainty, so that new accounts match against new accounts, and old accounts match against old accounts, while also keeping Elo for PS scoreboards and battle score updates feedback.
 
Last edited:
I do not understand most of the words here, but if you're able to implement this and it helps address suspect and ladder tour alts beating up on Little Timmy for 15 games while still preserving GXE / COIL / Elo for their current purposes, I am on board.
 
Pairing taking uncertainty into account makes sense to me. This sounds like it would primarily affect people who aren't that good, but also have a persistent ladder presence on a single name. I don't know how many people that is, so it's hard to tell how much of an effect this would have.
 
This sounds like it would primarily affect people who aren't that good
It also helps suspect / ladder tour players get to players of their own skill level faster, and hopefully makes grinding reqs less painful.
 
It also helps suspect / ladder tour players get to players of their own skill level faster, and hopefully makes grinding reqs less painful.
if i understand the proposal correctly, wouldn't this mean they're actually facing more brand new alts, though, if players of a similar uncertainty are paired? which in a suspect / ladder tour environment tend to be much better than other 1000s (or 1500 glickos idk) with low uncertainty, which means it would actually be HARDER to break even at the start of a laddering session on a new account (since you'd be facing much better players but with low ratings)?
 
yeah the primary consequence of this seems to be that when a bunch of people go make new alts at the same time for a suspect or OLT, they become dramatically more likely to play other people who are there doing the same thing, rather than playing established, but bad accounts in the low ratings. This is a good thing thing in principle (but definitely worth noting that TLs may need to lower reqs since there will be a lot more early losses and it will be harder to get a new account off the ground), but I also think it has the potential to overcorrect considerably too far.

Suppose someone just so happens to be joining and starting to ladder at the same time as a suspect or OLT... they now become dramatically more likely to get matched against a bunch of experienced players who are on new accounts, rather than against bad players. I could see arguments that this is a "good thing" too, since exposing new players to terrible players/teams in the low ladder is not ideal, but I would lean more towards saying that a new player losing their first X games to people who know what they are doing and are using good teams rather than having a shot at winning against bad players is probably even worse. I would think that ideally you want new accounts to play somewhat more of a mixture of new accounts and old bad accounts and not preferentially match with new accounts.

Another alternative to the smurfing problem (though it doesn't change how matchups occur, and maybe we should be trying to change that) is to basically just not count the impact of playing against the "smurf" accounts with prefixes for OLT/the suspect test against other accounts without the prefix or to refund/reset those changes after the fact. That's basically what chess.com does when titled players speedrun on smurfed accounts - speedrun accounts are pre-approved and then rating points are refunded to the unsuspecting players who lost to the smurfing grandmaster along the way.

Or even more simply to that, you could just have a separate ladder. If you want people with OLT accounts to be playing against new accounts (i.e. other people with OLT accounts) just make a new OLT ladder and require that only accounts with the OLT prefix can play on that ladder. No smurfing problem anymore since pre-existing bad players won't be there anymore in the low ladder. I doubt having more ladders is ideal either, but just a thought!
 
To me there are two main arguments here. Firstly, the proposal of a system where players stuck around 1000 Elo don't constantly have to play against smurfs, and secondly an argument against Elo as a measure of skill. To me, the first seems reasonable, the second much less so.

As for the first, it is achievable using a much less drastic measure than matchmaking based on Glicko: matchmaking partially based on Glicko at the low end of the ladder. The problem is almost entirely solved by having players below, say, 1200 Elo only match with people within, say, a difference of 100 Glicko. Yes, there's still smurfs above that level, but the density decreases drastically. Every new smurf starts at 1000, and there's also a not-insignificant amount of players who don't play on registered accounts and simply load a battle occasionally on a keyboard-mash name. An entirely new player would still face smurfs, but as the system is concerned both are "new accounts" so there's nothing to separate them. Of course the numbers in this proposal are arbitrary, and as in all matchmaking the restrictions would have to loosen over time to find a match on inactive ladders.

As for the second, the impurities of Elo are vastly overstated in the OP. A ladder floor is a modification, yes, but it only affects Elo gain below 1100 Elo. The effects radiate out of course, but by 1200 the effect will be negligible, and top ladder absolutely does not notice this. Furthermore, as mentioned in the FAQ, this is needed to prevent people increasing their rating simply by making a new account.

A variable K factor is entirely unproblematic. It only affects how fast ratings change! It does not affect the mechamisms of Elo at all except for this. And indeed, the one major thing people want is for Elo to change quicker at low levels to allow smurfs and such to move out of there faster, which is exactly what a high K does.

Also, to pretend that the PS implementation is unique or strange is misleading. Both rating floors and variable K values are commonplace, with for example the US Chess Federation using not just variable but individualized values for both K and floor.

The cost, then, of using Glicko for matchmaking would be serious. First of all, matchmaking is the thing that matters. If we do it by Glicko and sort the leaderboard by Elo, we're really just using Glicko and making the leaderboard useless.

All rating systems work fundamentally by calculating a winchance per player given their ratings, calculating their expected rating gain from the match using this winchance, and subtracting it from both while giving the winner some amount of points (in Elo: K). The reason they work this way is that it's self-balancing. If your winchance was lower than your rating implies you'll lose points in expectation, and vice versa. However, this does not work when the difference is great, such as when another system is used for matchmaking.

The self-balancing only works locally. That is, you have a certain winchance against someone 100 points lower than you and a certain winchance against someone 100 points higher, but that implies a specific value between those players who differ by 200 points which does not always work out. This is a fundamental triangle-inequality problem that is unsolvable, for any rating system. In practice it often means that these systems break down when players with vastly different ratings play.

Consider the current OU ladder. The two highest Glickos are at 2001 and 1919 respectively. These players should play each other if we matchmake by Glicko, and indeed either could snipe the other by noticing that the other is laddering (because they're quite far ahead of the competition). The player with 2001 would be the clear favorite, but both players would stand a chance. Unfortunately, the 2001 player is about 140 Elo lower than their opponent. Even if the 1919 player would get the "upset" win, they would gain almost no points. This makes it impossible for that player to get to rank 1, whereas a player with higher Elo and only 1785 Glicko (again, taken from the real ladder) would be propelled upward. Lower Glicko is rewarded.

The point of this illustration is that mixing rating systems can have serious negative consequences. It could even be used maliciously by sniping a player with high Elo/Glicko using accounts with high-ish Glicko and low Elo.

Simply using Glicko as a rating system is, mathematically, fine, but the reason Elo is used is because it's thought to be more fun (there was a good post from Zarel about this but I cannot seem to find it right now). And indeed, apart from the problem at low Elo, I think there's only very minor problems with the current system.

So, I propose tackling the low Elo problem as best as possible, without affecting the entirety of the ladder.

Also, without a floor this kind of trolling can get real impactful and annoying real fast:
1733870324137.png
 
Last edited:
This entire scheme is ill-advised. 1000 elo players being curbstomped by good players whenever they make fresh alts is not harmful and doesn’t really matter. If someone’s been stuck at 1000 long enough to have a low deviation then the odds are that they don’t actually care about winning and aren’t improving anyway. If they’re complaining about dealing with smurfs, that’s entirely on them for not taking the time to seek out sample teams. If they have a high deviation because they’re actually new then there’s no way to tell them apart from an alt of an experienced player rendering this entire proposed system moot. Besides, the K-values at the bottom of the ladder are so high that a 1000 elo player can easily make back whatever they lose to the rare alt.
I don’t know why this is being taken seriously because the only real effect it’s going to have if implemented is going to be to disincentivize laddering early in ladder tours or suspect tests to avoid having to deal with the disproportionately large number of good players that one would now have to deal with in low ladder.

tl;dr if a player is sitting at 1000 elo then they likely aren’t taking the game seriously at all and trying to improve their experience at the cost of messing with competitive events like ladder tours and suspects is a fool’s errand.
 
I've seen so many threads and proposals on this topic with none of them addressing the key issue which is the rating floor currently set at 1000 ELO. All fresh alts starting at 1000 is fine but what people keep bringing up is that people who are hard stuck at 1000 are getting stomped by people doing ladder tours and suspect reqs. The core problem I see is that these bad players are not allowed to fall below the 1000 ELO mark and are having their skill arbitrarily inflated by the ELO floor, for people who are already going very negative vs other 1000-1100 rated players they should not be forced to stay in that same bracket and should instead be allowed to drop below 1000 so that they can fight players at their own skill level.

Chess.com has already implemented this system and it works wonderfully, going even further by adding the option to select from beginner, intermediate or advanced when creating a fresh account that all place you at different starting ELOs. I personally see no problem with allowing ELO ratings to drop lower than what you are initially placed at given that ELO based systems are meant to implement this kind of functionality anyways, players who do not deserve a 1000 rating should not be forced to play there, removing the floor would allow these players to find fair competition at lower ratings while also entirely solving the problem of having to fight fresh alts of good players since someone with a true ELO of lets say 500-700 likely wouldn't be matching up against the fresh alts starting at 1000.

TL:DR ELO is not the problem with the current system but rather the current implementation of an ELO floor is
 
Can someone explain to me why the 1000 ELO floor was implemented? IIRC this wasn't a thing before 2014(?) and it sorta appeared randomly (or at least it felt random from my outside PoV at the time) and I didn't really understand why.

If low-skill players who persist on an alt are getting paired against smurfs/new alts and having their days ruined constantly (I think newbies tend to be more willing to just play persistently on the same low-rated alt anyway… anecdotally I can say both myself and my friends did this), wouldn't allowing people to drop below the starting value solve the problem by getting them to where they need to be for matchmaking after their first few games as opposed to artificially keeping them in vision of smurfs on new alts? Only issue with this is that a bad player could stay down for a while, get better, start climbing, and then hit a skill spike when they get back into vision of new alts again, so it could be viewed as pushing the problem somewhere else.

Alternatively, you could implement a placement match system like what you have in a lot of fighting games that can place you anywhere from 1000–1500 (or whatever range is deemed appropriate) depending on how you do in your first 5 games or whatever (maybe when Glicko deviation first drops below some threshold of significance?), but I don't think they have ever worked particularly well at least in the games that I have played and don't really know what they achieve that a proper uncapped ELO system doesn't already. I've always been kinda unimpressed with where I've been placed in other games, but it would sidestep this issue of having a glut of new alts on one rating and forcing everyone to deal with them no matter how strong/weak they are by keeping new alts them in a partially-restricted matchmaking pool or whatever.
 
Can someone explain to me why the 1000 ELO floor was implemented? IIRC this wasn't a thing before 2014(?) and it sorta appeared randomly (or at least it felt random from my outside PoV at the time) and I didn't really understand why.

If low-skill players who persist on an alt are getting paired against smurfs/new alts and having their days ruined constantly (I think newbies tend to be more willing to just play persistently on the same low-rated alt anyway… anecdotally I can say both myself and my friends did this), wouldn't allowing people to drop below the starting value solve the problem by getting them to where they need to be for matchmaking after their first few games as opposed to artificially keeping them in vision of smurfs on new alts? Only issue with this is that a bad player could stay down for a while, get better, start climbing, and then hit a skill spike when they get back into vision of new alts again, so it could be viewed as pushing the problem somewhere else.

Alternatively, you could implement a placement match system like what you have in a lot of fighting games that can place you anywhere from 1000–1500 (or whatever range is deemed appropriate) depending on how you do in your first 5 games or whatever (maybe when Glicko deviation first drops below some threshold of significance?), but I don't think they have ever worked particularly well at least in the games that I have played and don't really know what they achieve that a proper uncapped ELO system doesn't already. I've always been kinda unimpressed with where I've been placed in other games, but it would sidestep this issue of having a glut of new alts on one rating and forcing everyone to deal with them no matter how strong/weak they are by keeping new alts them in a partially-restricted matchmaking pool or whatever.
From what I can gather the floor was implemented to discourage ladder resets https://www.smogon.com/forums/threads/ladder-decay.3565362/post-6677523

You can also see glimpses of this in the W/L thread here https://www.smogon.com/forums/threads/ladder-and-rating-system-policy.3560096/

I have no horse in this race, most of the people behind this current implementation are gone now so an actual explanation is going to be difficult to come by
 
Back
Top