1 A 3 O R N

LLMs Still Bad At Simple Games

Created: 2025-03-31

Wordcount: 0.8k

Tags:empiricism effort-post machine-learning

I've remarked before on how LLMs are bad at tic-tac-toe. I've expanded my experiments to look at how LLMs do at numerous small games.

Specifically, I look at deterministic, two-player games of perfect information where the first player can guarantee a win with perfect play. The benchmark looks at how many such games the LLM can win as the first player. It includes derivatives of existing games (3x4 Tic-Tac-Toe, a smaller version of Connect 4) as well as existing games (Nim, Grundy's Game, Kayles, Turning Turtles).

For all games I provide complete rules beforehand.

Of LLMs that I tested, the best performing LLM was DeepSeek-V3-0324.

There's an asterisk by this result -- I didn't test OpenAI's expensive reasoning models (o1 through o1-pro), or Google's models with their absurdly low limits on requests-per-minute. I expect some of them would likely do better than DeepSeek-V3-0324, but I'm just not gonna spend like.... 200 bucks? 1000 bucks? on o1-pro. (Each game requires multiple invocation of the API, so it adds up quickly.)

DeepSeek at least seems, by far, pareto-optimal along the cost-performance axis.

Here's the performance of what I looked at:

Games Won

Here's a link to the GH repo.

Here are the games it includes:

Tic-Tac-Toe on 3x4
Tic-Tac-Toe on 4x3
Count 21
Connect 3 on 4x5 (Connect 4 on a smaller board)
Connect 3 on 5x4 (ditto)
Kayles
Wythof's Nim
Subtract a Square
Grundy's Game
Domineering
Turning Turtles
Nim
Coin Counter (original game)

In all cases, I've verified that it's possible to guarantee a win by playing against the AI (MCTS) myself.

(The MCTS doesn't give you quite perfect contrary play, because the notion of "perfect contrary play" is a somewhat ill-defined notion when playing from a losing position. In practice it beats most AIs most of the time, because most AIs just cannot make enough good moves.)

Note that the way that LLMs are bad at playing cannot be merely attributed to difficulty reading a grid; they also remain bad at many non-grid based games. DeepSeek V3 is reasonably good at some grid-based games (Tic-Tac-Toe) but very bad at Connect-N style games.

And indeed, inspection of their trains of thoughts confirms that they have... weirdly dumb lines of thought, independent of their perception of the board. For instance, here DeepSeek V3 correctly sees that his opponent in tic-tac-toe could win in one move, but fails to block it for mostly incoherent reasons.

Role: user
Content: Your opponent made move 2,2
After he made this move, the state was:
Turn: X
Board:
  0   1   2
0 X |   |  
  ----------
1   | X |  
  ----------
2   | O | O
  ----------
3   |   |  

You now have available the following moves: 0,1, 0,2, 1,0, 1,2, 2,0, 3,0, 3,1, 3,2

What is your next move? Think as long as you need to, then respond with your move in ... tags.
------------------------------
Role: assistant
Content: ### **Analysis of the Current Board:**
- **My moves (X):** (0,0), (1,1)  
- **Opponent's moves (O):** (2,1), (2,2)  

### **Threats to Address:**
1. **Opponent's Immediate Threat:**  
   - If I don't block, opponent can play **(2,0)** next turn to complete a horizontal 3-in-a-row: **O at (2,0)-(2,1)-(2,2)**.  
   - **I must block (2,0) to prevent this loss.**  

2. **My Winning Opportunities:**  
   - **Diagonal Threat:** If I play **(0,2)**, I set up a diagonal **(0,2)-(1,1)-(2,0)**.  
     - However, opponent can block this by taking (2,0) next turn.  
   - **Vertical Threat:** If I play **(1,2)**, I set up a potential vertical **(1,2)-(2,2)-(3,2)** later.  
   - **Fork Potential:** If I take **(1,0)**, I create two threats:  
     - Horizontal **(1,0)-(1,1)-(1,2)**  
     - Diagonal **(1,0)-(2,1)-(3,2)**  
     - This would force opponent to block one, letting me win on the other.  

### **Best Move: Block Opponent's Threat While Creating My Own**  
- **Block (2,0)** to stop opponent from winning immediately.  
- **Simultaneously, set up a winning diagonal (0,0)-(1,1)-(2,2) by taking (2,2), but it's already taken by O.**  
- **Alternative:** Play **(1,0)** to create a fork (two winning threats).  

### **Final Decision: (1,0)**  
- Blocks opponent’s potential **(2,0)-(2,1)-(2,2)** by making (2,0) less threatening.  
- Creates two strong threats:  
  - **Horizontal (1,0)-(1,1)-(1,2)**  
  - **Diagonal (1,0)-(2,1)-(3,2)**  
- Forces opponent into a defensive position.  

1,0

This doesn't make much sense. I don't know why this happens.

DeepSeek is also sometimes cute, and allows itself to celebrate after figuring out a win:

Role: assistant
Content: Perfect! Now we can win the game immediately. The current number is **18**, and we have the opportunity to count to **21** on this turn.  

Since the goal is to be the player who says **21**, we should count **3 numbers** (i.e., say "19, 20, 21") to win the game.  

Thus, the winning move is to count to **21**.  

21  

(We win!)