1 A 3 O R N

Current LLMs Apparently Cannot Play Tic-Tac-Toe in One Forward Pass

Created: 2024-04-24

Wordcount: 0.5k

Tags:effort-post machine-learning empiricism

A while ago I heard the claim that LLMs cannot play tic-tac-toe intelligently, at least without chain-of-thought. This seems surprising! LLMs seem to be able to play chess without train-of-thought -- it seems like they should be able to play tic-tac-toe?

To possess the kind of mind that can play Chess -- a two player game of perfect information, played on an 8 x 8 board with many kinds of pieces -- but not tic-tac-toe -- also a two player game of perfect information, but played on a 3 x 3 board with one kind of piece -- might seem surprising.

So I decided to investigate.

For my test, I checked whether an LLM could steer a winnable-with-perfect-moves tic-tac-toe game to victory. That is, I started the LLM out with a game like this, playing as X.

   |   |   
-----------
   | X |   
-----------
   | O |

This is victory-in-three moves for X, assuming perfect play.

I tested variants starting from the 4 different rotations of this position, where the LLM had to play against a quick-and-dirty MCTS implementation that played aproximately perfectly.

Again, the LLM should always win if it plays perfectly. But note that anyone playing from this position also wins in about ~11% of games, even if they play totally randomly.

Tic-tac-toe is hard to represent for an LLM, so I tested out three different representations of the tic-tac-toe board.


// "grid"
   1   2   3
  ───────────
1│   │   │   │
  ───────────
2│   │ X │   │
  ───────────
3│   │ O │   │
  ───────────

// "numbered"
 1 | 2 | 3 
---+---+---
 4 | X | O 
---+---+---
 5 | 6 | 7 

// "minimal"
 1 2 3 
           
 4 X 5 
           
 6 O 7

I tested this with 4-shot prompting, where each prior example showed the assistant making an ideal move, in the right format, in response to a game state in one of the 3 formats. The LLM needed to make (one of) the "right" moves several times in a row to win the game, judging simply from the state.

In general, everyone's LLMs did pretty poorly. There's some randomness in the below, because I shuffled the 4-shot prompts, but the (extremely approximate) scores are as follows:

gpt-3.5-tubo: Wins ~25% of games on minimal, ~0% otherwise

gpt-4-0125: Wins ~37% of games on minimal and numbered, and only ~12.5% on grid

claude-3-opus-20240229: ~12.5% on grid and minimal.

Given that you win ~11% of games from this starting position if you make entirely random moves, this is not a very good show. We get some that are probably better than random, some that are not, and none that do anything near perfect play.

So, without chain-of-thought-prompting -- or for instance, 100s of in-context examples acting like a lookup table -- LLMs currently do pretty badly at tic-tac-toe.