Monday, 23 February 2026

"Mistaken kudos you received — it's a puzzle (6)"

I mentioned in the previous post that the LLM generated puzzle that Cracking the Cryptic had featured - turns out it got more mainstream attention.  In the Times puzzle weekly wrap-up email, Mick Hodgkin wrote the following:
Having said in that Times piece that the cryptic crossword “could be the last holdout of human puzzling ingenuity”, I felt compelled to put it to the test once more, asking Gemini to write a cryptic clue for SUDOKU. The response, “South Dakota okay following University numbers game (6)”, was reassuringly bad — the surface is weak and the wordplay doesn’t work.

I thought I'd also try putting that to the test.  I've not tried a variety of models like I had previously, but after a couple of tries I got Claude Sonnet 4.6 (not the thinking version, interestingly) to come up with the following:

"Scrambled kudos you received — it's a puzzle (6)"
  • Anagram ("scrambled") of KUDOS + U (you) = SUDOKU ✓
  • Definition: "a puzzle" ✓
  • Surface: reads naturally as "the praise you received got all mixed up" ✓
I looked at that and thought it was basically there - I decided to change the word "scrambled" for "mistaken" - and then realised I had beautifully ironic material for a blog post.

Anyhow, here's the prompt I gave it.  I'm pretty sure this was derived by itself, but I think next time I'd amend to explicitly call for something original and that wasn't looked up from somewhere:
I want to write a good cryptic crossword clue for “SUDOKU” (6)

A good clue satisfies the following:

- a surface that reads well and misleads on first reading
- valid wordplay
- correct definition
- no extra words that aren’t wordplay or definition.

Ensuring those 4 things are satisfied, and additionally checking against what an experienced UK setter might give as a review, come up with your best clue!

I will blog more on Thomas Snyder's observation (I'm struggling to reference that observation more clearly/directly than this) that a good parallel of AI capability can be found in the description of the J3016 Automation levels:

0 No Automation 
1 Driver Assistance 
2 Partial Automation
3 Conditional Automation 
4 High Automation
5 Full Automation

I think this is a good way of thinking about automating tasks by specifying them well vs. truly autonomous intelligence.  Even then, I have my doubts as to whether level 5 really means intelligence, but in the crossword clue setting context I think what I've done today only really counts as level 1.  Still, it's enough for me to say that if you want to give me kudos for this clue, then I kind of feel you are mistaken.  Maybe!

Thursday, 19 February 2026

Large Language Models and Sudoku

I am being a somewhat stick-in-the-mud for this post by refusing to use the terms AI or Artificial Intelligence.

So the reason for making this post is me seeing a Cracking the Cryptic video earlier this week featuring a puzzle created as the result of a one shot prompt to a large language model (in this case Claude Opus 4.6).  This got me thinking: I'm generally aware that the underlying base models have all sorts of capability for a while - after all once you've trained on the entire internet there isn't really anywhere else to go - and that recent progress is more on the various thinking and fine tuning that you layer on top of that.

In the video Simon made a rather bold claim regarding the "Cracking the Cryptic" test - i.e. that he'd pretty much be able to tell the difference between something entirely produce by electricity and silicon, rather than with more thoughtful human creativity (even allowing for software assistance).

With that in mind I wanted to put various models to the test.  I started small with 6x6, came up with a prompt and see what happened.  And that was the following, which I think are all reasonably interesting, and especially puzzles 2 and 4:


(p.s. I've just about had enough of blogger - uploading an image shouldn't require turning off privacy options.  A blog change is now 100% on the cards whenever I get round to it.)

These were puzzles produced by the following LLMs:
1. GPT 5.2 Thinking
2. Claude Sonnet 4.6 Thinking
3. Grok 4.1 Thinking
4. Gemini 3 Pro

I'll add a note that non-thinking models were hopeless with this task.  I'll add that Grok and Gemini have question marks in my mind about whether they fluked it, because although the puzzles are fine, they couldn't output the corresponding solution correctly.  Both also failed to add 3x2 regions correctly.

I'll add a second note that generating the puzzles as text seemed easy in comparison to generating the puzzles as images  The diffusion models attached to LLMs are bloody useless with puzzle grids, and require thinking and programming tool use to get right.

I'll add a third note that I don't currently have access to Claude Opus 4.6 which created the CtC puzzle.

Contact Form

Name

Email *

Message *