A while ago I heard the claim that LLMs cannot play tic-tac-toe intelligently, at least without chain-of-thought. This seems surprising! LLMs seem to be able to play chess without train-of-thought -- it seems like they should be able to play tic-tac-toe?
To possess the kind of mind that can play Chess -- a two player game of perfect information, played on an 8 x 8 board with many kinds of pieces -- but not tic-tac-toe -- also a two player game of perfect information, but played on a 3 x 3 board with one kind of piece -- might seem surprising.
So I decided to investigate.
For my test, I checked whether an LLM could steer a winnable-with-perfect-moves tic-tac-toe game to victory. That is, I started the LLM out with a game like this, playing as X.
| | ----------- | X | ----------- | O |
This is victory-in-three moves for X, assuming perfect play.
I tested variants starting from the 4 different rotations of this position, where the LLM had to play against a quick-and-dirty MCTS implementation that played aproximately perfectly.
Again, the LLM should always win if it plays perfectly. But note that anyone playing from this position also wins in about ~11% of games, even if they play totally randomly.
Tic-tac-toe is hard to represent for an LLM, so I tested out three different representations of the tic-tac-toe board.
// "grid"
1 2 3
───────────
1│ │ │ │
───────────
2│ │ X │ │
───────────
3│ │ O │ │
───────────
// "numbered"
1 | 2 | 3
---+---+---
4 | X | O
---+---+---
5 | 6 | 7
// "minimal"
1 2 3
4 X 5
6 O 7
I tested this with 4-shot prompting, where each prior example showed the assistant making an ideal move, in the right format, in response to a game state in one of the 3 formats. The LLM needed to make (one of) the "right" moves several times in a row to win the game, judging simply from the state.
In general, everyone's LLMs did pretty poorly. There's some randomness in the below, because I shuffled the 4-shot prompts, but the (extremely approximate) scores are as follows:
gpt-3.5-tubo: Wins ~25% of games on minimal, ~0% otherwise
gpt-4-0125: Wins ~37% of games on minimal and numbered, and only ~12.5% on grid
claude-3-opus-20240229: ~12.5% on grid and minimal.
Given that you win ~11% of games from this starting position if you make entirely random moves, this is not a very good show. We get some that are probably better than random, some that are not, and none that do anything near perfect play.
So, without chain-of-thought-prompting -- or for instance, 100s of in-context examples acting like a lookup table -- LLMs currently do pretty badly at tic-tac-toe.