Sunday 14 February 2021

Puzzle 356: Arrow Sudoku

Well, I don't see why everyone else should be having fun with arrow sudoku and leave me feeling left out.  Who knows?  Maybe I'll make a few more.  

A.K.A. you need a spare arrow somewhere to break the symmetry.  Enjoy!
    #356 Arrow Sudoku – rated 6/10 [Hard]
All puzzles © Tom Collyer 2009-20.

Friday 5 February 2021

Puzzle 355: Arrow Sudoku

Look at me - I can make arrow sudoku too! Although perhaps not as difficult as the modern taste demands.  I thought this was fun anyway.  

In case you haven't seen arrow sudoku before, numbers placed in cells with circles must equal the sum of numbers placed along their corresponding arrows.  Numbers placed on arrows may repeat.  Enjoy!
    #355 Arrow Sudoku – rated 4/10 [Medium]
All puzzles © Tom Collyer 2009-21.

Puzzle Rating Systems 2: Croco and Puzzle Duel

I'm going to continue my discussion about rating systems by exploring a couple of examples from the puzzling world.  I will apologise at this point for not setting up something like MathJax on the blog and making the formulae look a bit nicer.  Maybe I'll do that later.

The first example is from a website that sadly isn't running any more: croco puzzle.  This was a German puzzle website featuring daily puzzle(s) which placed you on a daily leaderboard, and then also on an overall leaderboard, functioning as a rating system. I won't go into the full details of the mechanics here, as thankfully a record of the mechanics has been lovingly documented by Chris Dickson here:

The key features of the rating system have much in common with Elo.  You start off with a current rating, you have a performance event, and from that you can calculate an updated rating by starting with the current rating and somehow adjusting it by the difference between your current rating and what the rating of the performance implied.  There is something pleasingly Bayesian about the whole state of affairs.

On croco puzzle, you started off with a rating of 0.  The evaluation of performance is given by assessing your solving time of a particular puzzle vs. the median solving time of the entire population of the solving population.  Specifically:
Performance = 3000 * 1 / 2 ^ [(Your Time - The Best Time) / (Median Time - The Best Time)]
The update mechanism is simpler, and occurred every day:
New rating = Current Rating + (Performance Rating - Current Rating) / 60
[N.B. these formulas weren't completely consistent over time; there was some experimentation varying the parameters set here at 2 and 60 over different periods.  However I don't want to overcomplicate the exposition.]

Before talking about the dynamics, it's worth noting that performance evaluation is entirely dependent on the ability of the crowd on the day.  This aggregate ability might change over time because a different number of individuals might be playing on any two given days.  The aggregate ability might also change even if the individuals making up the population remains the same, as their overall abilities improve (or decline).  This means that the performance benchmark to which all performances are being measured are likely to be stable from day to day, but perhaps not from year to year.  Indeed, I recall some analysis provided on the website demonstrating that ratings didn't really seem to be comparable over the longer term (screenshot dug up from

The rating system shares some of the limitations I noted with Elo, particularly with regards to the dynamics.  It took a great deal of time for a rating to converge to a sensible level, and the higher your rating, the more unstable it became, particularly past the median mark of 1500.  The instability compared to Elo was somewhat asymmetric - good performances were very much a case of diminishing returns from the update mechanism, but bad performances ended up being very costly.  To what extent this was an incentive to play is probably debatable, but there were certainly examples of individuals cherry-picking the puzzles they played.

In fairness the croco system was designed to be slow to build up a sense of a long term achievement.  What I found interesting is that alongside rating, a daily "Tagesrating" was published, giving a more complete picture of recent form, as opposed to long term ability.  This is potentially where the use of the best time as a defining parameter might skew the daily evaluation, depending on who exactly got the best time, and how extreme it was in comparison to everyone else.  I shouldn't be surprised that Thomas Snyder had some thoughts on this at the time.

Puzzle Duel is a puzzle website which runs using a similar idea, although uses slightly different mechanic.

The explanation given is here:, although I'll also leave a screenshot to record things as well.

Again the starting point is a rating of 0.  The performance evaluation is there in the formula, but I'm not sure I've understood absolutely correctly the update mechanism.  I think it's the case that the ratings as per the leaderboard are calculated only weekly, despite there being a daily puzzle to solve.  That is to say the update mechanism runs in aggregate:
End of Week rating = Start of Week Rating + [Sum (Performance Rating - Start of Week Rating) / 7] ^ 0.66
Certainly you can see that individual points are calculated for each user every day, which I  assume are calculated using a single point sum, rather than all 7:

When it comes to the dynamics of the puzzle duel rating, the same observations apply: as performance is benchmarked specifically to the median solving time, we are dealing with something that is likely to be stable from day to day, but perhaps not year to year.  That said, there is no dependence on the best time any more; instead every shift is measured relative to multiples of the median.  The update mechanic implies that 1500 means that on average you are 2x as slow as the median solving time, and 2500 means that you are 2x as fast; similarly 1000 means you are 4x as slow and 3000 means you are 4x as fast.  

In practice I think this will end up bunching the majority of ratings quite tightly around the 2000 mark long term, which is possibly not good news for an accurate ranking system.  For now, individual ratings are still generally converging to their level - in particular they are mostly still increasing and I suppose this means that everyone is getting a good feeling when they come to look at their rating.

The equivalent weekly performance ratings are also viewable as a time series alongside the overall rating:

Perhaps time will tell, but it may well be the case that a bad performance (particularly if you have a submission error) is likely to be as asymmetrically punishing as it was for croco.  Puzzle Duel says that there is an initial bonus given to newcomers - but I'm not sure how effective that has been in getting people more towards their level.  Perhaps that's not the intention.

I'll conclude with a germ of a thought that I might shape the next post around.  I think one area to possibly consider is the asymmetry of the update mechanism, depending on where you currently sit overall.  I'd like to see a little more subtlety in the update mechanism, where recent form can more dynamically amplify or dampen down the magnitude of the update.  The desired dynamic would be to quickly accelerate an individual to a stable rating, and then leave them there if recent form stabilises.  In that way one bad performance in an otherwise stable run of form might not require several good performances to make up the lost ground, and one good performance isn't going to change everything overnight either.  On the other hand, should recent form start consistently moving away from the rating as part of a genuine shift in ability, then the update could become more sensitive and quickly re-converge.  More on that next time.

Tuesday 2 February 2021

Puzzle Rating Systems part 1: Elo and Credit Scores

The subject of rating systems is something that has been on my mind for many years, and I'd like to spend a bit of time blogging about my thoughts, to see where I end up as much as anything else.  Hopefully I'll get a couple of useful comments along the way from some of my dearest readers.  I've come across a few different rating systems in the puzzle solving context, and indeed beyond that, and whilst some of them have their pros and cons none of them seems to excite me particularly.  I think I could eventually do better!

I think my first exposure to rating systems was in the context of online matchmaking for the games Age of Empires 2 and Age of Mythology, which I used to play a lot before the days that sudoku really was a thing.  (I am mildly tickled to have discovered recently that Age of Empires 2 is still thriving, and indeed flourishing!).  Anyhow, the idea was there was a ladder based on the Elo system, with individuals gaining or losing points after a match depending on (1) their previous rating and (2) the difference in rating between them and their opponent that they either won or lost to.  The system more or less worked to get games between players of equal skills, but as the basis for a rating system it did seem to leave a lot to be desired.

[Incidentally, the same Elo system is used to provide the basis of Chess rankings.]

The limitations that I remember thinking about Elo at the time are:
  • Individual ratings didn't seem to be very stable.
  • Ratings weren't particularly comparable over time, and in particular seemed to be dependent on the total number of active players.
  • Individuals could manipulate their own ratings by attempting to cherrypicking their opponents.
  • In some cases Elo produced an incentive to an individual not to play at all if they believed they had maximised their rating.
  • It took a certain amount of time for Elo to converge on a reasonable rating for a player - this certain led to some frustration when a good player would "smurf" on a new name and beat up on less good players as they rose through the ranks.
I'm sure there is no shortage of literature out there on the Elo system in regards to some of these limitations.  Maybe I'll talk about some of that in further posts.

More recently, I think a lot about rating systems in my job where rating systems are used to estimate credit risk.  I suppose many people will have heard of credit scores, where the higher score you have, the lower credit risk you are supposed to have.  The general idea is to produce a good discrimination in your scoring system, so that when you line up your population in score-order and observe what happens, then most of the credit that went bad should be at the low score end and almost none of it should be at the high score end.  The extent to which this is achieved tends to get measured by the Gini coefficient, or by the area under the receiver operator characteristic ("ROC") curve.

[Incidentally, this is also something that you might have heard about in relation to "machine learning".  I find it very amusing that people in credit risk have been doing "data science", "artificial intelligence" and "machine learning" for decades and decades; practitioners have preferred to call this "modelling", or maybe sometimes "statistical modelling".]

Once you have a good rank order of your population, there is a secondary task to calibrate the ranking to an overall risk level for the population.  It's no good if your system implies that 5% of your population is going to go bad, to then observe that actually 10% subsequently did go bad, even if your rank ordering held up OK.  (Even a difference between 2% implied and, say, 2.2% observed is a BIG deal in credit risk - you are out by a factor of 20% and this can end up being worth 10s of £millions).  Looking at rating systems from this point of view suggests that it might be good idea if you can provide a link between your rankings and some kind of probability.

Comparing Elo and credit ratings isn't exactly like for like, but in theory there are some probabilistic statements about Elo that can be made.  For example, if two players have the same rating, then the probability of winning should be 50/50.  But if the gap between the players is large then this might correspondingly by 25/75 or 5/95 or so on.  This is where there is more to the comparison than meets the eye, as both examples make use of the logistic curve.

I'm going to wrap things up here and save further detail for future posts.  I think the point with all of this is that I'm not convinced you can be ever be sure that Elo ratings are accurate in the first place, and this has to do with the score update mechanism.  Elo assumed that player's skill levels could be represented by a normal random variable, whose mean slowly varied over time.  However, the issue here is with the score update mechanism, which at times isn't sensitive enough (in getting new players up to speed with a suitable rating) and at others is too sensitive (leading to erratic increases and decreases in rating), and overall doesn't seem to do a great job in tracking how that mean might vary over time (and indeed says nothing about the variance around that mean!).  That said, overall Elo roughly does gets you to the right place, eventually, but it seems to me that there might be room for improvement.

Contact Form


Email *

Message *