I think my first exposure to rating systems was in the context of online matchmaking for the games Age of Empires 2 and Age of Mythology, which I used to play a lot before the days that sudoku really was a thing. (I am mildly tickled to have discovered recently that Age of Empires 2 is still thriving, and indeed flourishing!). Anyhow, the idea was there was a ladder based on the Elo system, with individuals gaining or losing points after a match depending on (1) their previous rating and (2) the difference in rating between them and their opponent that they either won or lost to. The system more or less worked to get games between players of equal skills, but as the basis for a rating system it did seem to leave a lot to be desired.
[Incidentally, the same Elo system is used to provide the basis of Chess rankings.]
The limitations that I remember thinking about Elo at the time are:
- Individual ratings didn't seem to be very stable.
- Ratings weren't particularly comparable over time, and in particular seemed to be dependent on the total number of active players.
- Individuals could manipulate their own ratings by attempting to cherrypicking their opponents.
- In some cases Elo produced an incentive to an individual not to play at all if they believed they had maximised their rating.
- It took a certain amount of time for Elo to converge on a reasonable rating for a player - this certain led to some frustration when a good player would "smurf" on a new name and beat up on less good players as they rose through the ranks.
I'm sure there is no shortage of literature out there on the Elo system in regards to some of these limitations. Maybe I'll talk about some of that in further posts.
More recently, I think a lot about rating systems in my job where rating systems are used to estimate credit risk. I suppose many people will have heard of credit scores, where the higher score you have, the lower credit risk you are supposed to have. The general idea is to produce a good discrimination in your scoring system, so that when you line up your population in score-order and observe what happens, then most of the credit that went bad should be at the low score end and almost none of it should be at the high score end. The extent to which this is achieved tends to get measured by the Gini coefficient, or by the area under the receiver operator characteristic ("ROC") curve.
[Incidentally, this is also something that you might have heard about in relation to "machine learning". I find it very amusing that people in credit risk have been doing "data science", "artificial intelligence" and "machine learning" for decades and decades; practitioners have preferred to call this "modelling", or maybe sometimes "statistical modelling".]
Once you have a good rank order of your population, there is a secondary task to calibrate the ranking to an overall risk level for the population. It's no good if your system implies that 5% of your population is going to go bad, to then observe that actually 10% subsequently did go bad, even if your rank ordering held up OK. (Even a difference between 2% implied and, say, 2.2% observed is a BIG deal in credit risk - you are out by a factor of 20% and this can end up being worth 10s of £millions). Looking at rating systems from this point of view suggests that it might be good idea if you can provide a link between your rankings and some kind of probability.
Comparing Elo and credit ratings isn't exactly like for like, but in theory there are some probabilistic statements about Elo that can be made. For example, if two players have the same rating, then the probability of winning should be 50/50. But if the gap between the players is large then this might correspondingly by 25/75 or 5/95 or so on. This is where there is more to the comparison than meets the eye, as both examples make use of the logistic curve.
I'm going to wrap things up here and save further detail for future posts. I think the point with all of this is that I'm not convinced you can be ever be sure that Elo ratings are accurate in the first place, and this has to do with the score update mechanism. Elo assumed that player's skill levels could be represented by a normal random variable, whose mean slowly varied over time. However, the issue here is with the score update mechanism, which at times isn't sensitive enough (in getting new players up to speed with a suitable rating) and at others is too sensitive (leading to erratic increases and decreases in rating), and overall doesn't seem to do a great job in tracking how that mean might vary over time (and indeed says nothing about the variance around that mean!). That said, overall Elo roughly does gets you to the right place, eventually, but it seems to me that there might be room for improvement.
this is a very interesting topic. I am familiar with the ELO rating system from chess; the obvious difference is that puzzles are typically not played in matches, but solved against a clock. When it comes to puzzle ratings, I have seen a few attempts (e.g. on Croco-Puzzle). None of them were perfect, so I am really excited to hear more about your ideas in future posts!
Just one fleeting thought: You mention stability in your article. Something one should always keep in mind is that these skills - chess or puzzle solving - are generally not stable (at least in my perception). I am not just talking about momentary lapses due to a loss of focus, but actual long-term evolution. People certainly seem to get stronger or weaker when it comes to specific puzzle types, and I believe this is also the case for overall skills.
I think something similar might apply to long-term comparability. The generell skill set of the community - in its entirety - might improve over the years, at least in certain areas (for example when new solving techniques are discovered and spread, or when the popularity of certain puzzle styles suddenly increases). Chess players are also frequently discussing whether ELO differences between now and 50 years ago are due to inflationary effects, or because the overall chess understanding has truly grown.
There's a good quote on Wikipedia attributed to Elo: apparently he once stated that the process of rating players was in any case rather approximate; he compared it to "the measurement of the position of a cork bobbing up and down on the surface of agitated water with a yard stick tied to a rope and which is swaying in the wind".
I'll have to dig up the details about the croco system. I thought it was very good, although the very long term nature of the ratings as per the ladder seemed very frustrating given that they were also able to produce a graph that looked much more like recent form. I think the next post I will do will focus on the puzzle duel system, which seems to have much in common with croco - that will allow me to discuss the relative advantages you have in trying to measure puzzle performance as well