Friday 5 February 2021

Puzzle Rating Systems 2: Croco and Puzzle Duel

I'm going to continue my discussion about rating systems by exploring a couple of examples from the puzzling world.  I will apologise at this point for not setting up something like MathJax on the blog and making the formulae look a bit nicer.  Maybe I'll do that later.

The first example is from a website that sadly isn't running any more: croco puzzle.  This was a German puzzle website featuring daily puzzle(s) which placed you on a daily leaderboard, and then also on an overall leaderboard, functioning as a rating system. I won't go into the full details of the mechanics here, as thankfully a record of the mechanics has been lovingly documented by Chris Dickson here:

The key features of the rating system have much in common with Elo.  You start off with a current rating, you have a performance event, and from that you can calculate an updated rating by starting with the current rating and somehow adjusting it by the difference between your current rating and what the rating of the performance implied.  There is something pleasingly Bayesian about the whole state of affairs.

On croco puzzle, you started off with a rating of 0.  The evaluation of performance is given by assessing your solving time of a particular puzzle vs. the median solving time of the entire population of the solving population.  Specifically:
Performance = 3000 * 1 / 2 ^ [(Your Time - The Best Time) / (Median Time - The Best Time)]
The update mechanism is simpler, and occurred every day:
New rating = Current Rating + (Performance Rating - Current Rating) / 60
[N.B. these formulas weren't completely consistent over time; there was some experimentation varying the parameters set here at 2 and 60 over different periods.  However I don't want to overcomplicate the exposition.]

Before talking about the dynamics, it's worth noting that performance evaluation is entirely dependent on the ability of the crowd on the day.  This aggregate ability might change over time because a different number of individuals might be playing on any two given days.  The aggregate ability might also change even if the individuals making up the population remains the same, as their overall abilities improve (or decline).  This means that the performance benchmark to which all performances are being measured are likely to be stable from day to day, but perhaps not from year to year.  Indeed, I recall some analysis provided on the website demonstrating that ratings didn't really seem to be comparable over the longer term (screenshot dug up from

The rating system shares some of the limitations I noted with Elo, particularly with regards to the dynamics.  It took a great deal of time for a rating to converge to a sensible level, and the higher your rating, the more unstable it became, particularly past the median mark of 1500.  The instability compared to Elo was somewhat asymmetric - good performances were very much a case of diminishing returns from the update mechanism, but bad performances ended up being very costly.  To what extent this was an incentive to play is probably debatable, but there were certainly examples of individuals cherry-picking the puzzles they played.

In fairness the croco system was designed to be slow to build up a sense of a long term achievement.  What I found interesting is that alongside rating, a daily "Tagesrating" was published, giving a more complete picture of recent form, as opposed to long term ability.  This is potentially where the use of the best time as a defining parameter might skew the daily evaluation, depending on who exactly got the best time, and how extreme it was in comparison to everyone else.  I shouldn't be surprised that Thomas Snyder had some thoughts on this at the time.

Puzzle Duel is a puzzle website which runs using a similar idea, although uses slightly different mechanic.

The explanation given is here:, although I'll also leave a screenshot to record things as well.

Again the starting point is a rating of 0.  The performance evaluation is there in the formula, but I'm not sure I've understood absolutely correctly the update mechanism.  I think it's the case that the ratings as per the leaderboard are calculated only weekly, despite there being a daily puzzle to solve.  That is to say the update mechanism runs in aggregate:
End of Week rating = Start of Week Rating + [Sum (Performance Rating - Start of Week Rating) / 7] ^ 0.66
Certainly you can see that individual points are calculated for each user every day, which I  assume are calculated using a single point sum, rather than all 7:

When it comes to the dynamics of the puzzle duel rating, the same observations apply: as performance is benchmarked specifically to the median solving time, we are dealing with something that is likely to be stable from day to day, but perhaps not year to year.  That said, there is no dependence on the best time any more; instead every shift is measured relative to multiples of the median.  The update mechanic implies that 1500 means that on average you are 2x as slow as the median solving time, and 2500 means that you are 2x as fast; similarly 1000 means you are 4x as slow and 3000 means you are 4x as fast.  

In practice I think this will end up bunching the majority of ratings quite tightly around the 2000 mark long term, which is possibly not good news for an accurate ranking system.  For now, individual ratings are still generally converging to their level - in particular they are mostly still increasing and I suppose this means that everyone is getting a good feeling when they come to look at their rating.

The equivalent weekly performance ratings are also viewable as a time series alongside the overall rating:

Perhaps time will tell, but it may well be the case that a bad performance (particularly if you have a submission error) is likely to be as asymmetrically punishing as it was for croco.  Puzzle Duel says that there is an initial bonus given to newcomers - but I'm not sure how effective that has been in getting people more towards their level.  Perhaps that's not the intention.

I'll conclude with a germ of a thought that I might shape the next post around.  I think one area to possibly consider is the asymmetry of the update mechanism, depending on where you currently sit overall.  I'd like to see a little more subtlety in the update mechanism, where recent form can more dynamically amplify or dampen down the magnitude of the update.  The desired dynamic would be to quickly accelerate an individual to a stable rating, and then leave them there if recent form stabilises.  In that way one bad performance in an otherwise stable run of form might not require several good performances to make up the lost ground, and one good performance isn't going to change everything overnight either.  On the other hand, should recent form start consistently moving away from the rating as part of a genuine shift in ability, then the update could become more sensitive and quickly re-converge.  More on that next time.

1 comment:

  1. Hi, thank you for the interesting analysis. The Puzzle Duel site history is definitely too short and data is still unstable.
    Two small notes:
    - the rating change numbers for individual puzzles which you can see in the floating window are "fake" - they reflect performance for each puzzle, but calculated after weekly change is calculated and are chosen to sum up to given value.
    - IMHO, interesting point for analysis is a quite stable weekly performance for most of the solvers (those who solve regularly). This is especially noticeable if you choose longer time interval in rating history (you can start from April 1st, 2020). For example, the current leader has almost all values between 2400 and 2800. For me they are between 2000 and 2400. I think this means that rating will remain distinguishing. But I don't want rating to be used to really compare solvers. I would prefer to consider it as an instrument which each solver can use to understand if his ability to solve puzzles is growing or not.


Contact Form


Email *

Message *