I think my first exposure to rating systems was in the context of online matchmaking for the games Age of Empires 2 and Age of Mythology, which I used to play a lot before the days that sudoku really was a thing. (I am mildly tickled to have discovered recently that Age of Empires 2 is still thriving, and indeed flourishing!). Anyhow, the idea was there was a ladder based on the Elo system, with individuals gaining or losing points after a match depending on (1) their previous rating and (2) the difference in rating between them and their opponent that they either won or lost to. The system more or less worked to get games between players of equal skills, but as the basis for a rating system it did seem to leave a lot to be desired.
[Incidentally, the same Elo system is used to provide the basis of Chess rankings.]
The limitations that I remember thinking about Elo at the time are:
- Individual ratings didn't seem to be very stable.
- Ratings weren't particularly comparable over time, and in particular seemed to be dependent on the total number of active players.
- Individuals could manipulate their own ratings by attempting to cherrypicking their opponents.
- In some cases Elo produced an incentive to an individual not to play at all if they believed they had maximised their rating.
- It took a certain amount of time for Elo to converge on a reasonable rating for a player - this certain led to some frustration when a good player would "smurf" on a new name and beat up on less good players as they rose through the ranks.
I'm sure there is no shortage of literature out there on the Elo system in regards to some of these limitations. Maybe I'll talk about some of that in further posts.
More recently, I think a lot about rating systems in my job where rating systems are used to estimate credit risk. I suppose many people will have heard of credit scores, where the higher score you have, the lower credit risk you are supposed to have. The general idea is to produce a good discrimination in your scoring system, so that when you line up your population in score-order and observe what happens, then most of the credit that went bad should be at the low score end and almost none of it should be at the high score end. The extent to which this is achieved tends to get measured by the Gini coefficient, or by the area under the receiver operator characteristic ("ROC") curve.
[Incidentally, this is also something that you might have heard about in relation to "machine learning". I find it very amusing that people in credit risk have been doing "data science", "artificial intelligence" and "machine learning" for decades and decades; practitioners have preferred to call this "modelling", or maybe sometimes "statistical modelling".]
Once you have a good rank order of your population, there is a secondary task to calibrate the ranking to an overall risk level for the population. It's no good if your system implies that 5% of your population is going to go bad, to then observe that actually 10% subsequently did go bad, even if your rank ordering held up OK. (Even a difference between 2% implied and, say, 2.2% observed is a BIG deal in credit risk - you are out by a factor of 20% and this can end up being worth 10s of £millions). Looking at rating systems from this point of view suggests that it might be good idea if you can provide a link between your rankings and some kind of probability.
Comparing Elo and credit ratings isn't exactly like for like, but in theory there are some probabilistic statements about Elo that can be made. For example, if two players have the same rating, then the probability of winning should be 50/50. But if the gap between the players is large then this might correspondingly by 25/75 or 5/95 or so on. This is where there is more to the comparison than meets the eye, as both examples make use of the logistic curve.
I'm going to wrap things up here and save further detail for future posts. I think the point with all of this is that I'm not convinced you can be ever be sure that Elo ratings are accurate in the first place, and this has to do with the score update mechanism. Elo assumed that player's skill levels could be represented by a normal random variable, whose mean slowly varied over time. However, the issue here is with the score update mechanism, which at times isn't sensitive enough (in getting new players up to speed with a suitable rating) and at others is too sensitive (leading to erratic increases and decreases in rating), and overall doesn't seem to do a great job in tracking how that mean might vary over time (and indeed says nothing about the variance around that mean!). That said, overall Elo roughly does gets you to the right place, eventually, but it seems to me that there might be room for improvement.