A while ago I heard the claim that LLMs cannot play tic-tac-toe intelligently, at least without chain-of-thought. This seems surprising! LLMs seem to be able to play chess without train-of-thought -- it seems like they should be able to play tic-tac-toe?
To possess the kind of mind that can play Chess -- a two player game of perfect information, played on an 8 x 8 board with many kinds of pieces -- but not tic-tac-toe -- also a two player game of perfect information, but played on a 3 x 3 board with one kind of piece -- might seem surprising.
So I decided to investigate.
For my test, I checked whether an LLM could steer a winnable-with-perfect-moves tic-tac-toe game to victory. That is, I started the LLM out with a game like this, playing as X.
| | ----------- | X | ----------- | O |
This is victory-in-three moves for X, assuming perfect play.
I tested variants starting from the 4 different rotations of this position, where the LLM had to play against a quick-and-dirty MCTS implementation that played aproximately perfectly.
Again, the LLM should always win if it plays perfectly. But note that anyone playing from this position also wins in about ~11% of games, even if they play totally randomly.
Tic-tac-toe is hard to represent for an LLM, so I tested out three different representations of the tic-tac-toe board.
// "grid" 1 2 3 ─────────── 1│ │ │ │ ─────────── 2│ │ X │ │ ─────────── 3│ │ O │ │ ─────────── // "numbered" 1 | 2 | 3 ---+---+--- 4 | X | O ---+---+--- 5 | 6 | 7 // "minimal" 1 2 3 4 X 5 6 O 7
I tested this with 4-shot prompting, where each prior example showed the assistant making an ideal move, in the right format, in response to a game state in one of the 3 formats. The LLM needed to make (one of) the "right" moves several times in a row to win the game, judging simply from the state.
In general, everyone's LLMs did pretty poorly. There's some randomness in the below, because I shuffled the 4-shot prompts, but the (extremely approximate) scores are as follows:
gpt-3.5-tubo
: Wins ~25% of games on minimal
, ~0% otherwise
gpt-4-0125
: Wins ~37% of games on minimal
and numbered
, and only ~12.5% on grid
claude-3-opus-20240229
: ~12.5% on grid
and minimal
.
Given that you win ~11% of games from this starting position if you make entirely random moves, this is not a very good show. We get some that are probably better than random, some that are not, and none that do anything near perfect play.
So, without chain-of-thought-prompting -- or for instance, 100s of in-context examples acting like a lookup table -- LLMs currently do pretty badly at tic-tac-toe.