May the best token win

The Token Games

Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger*  ·  Gabriel Poesia*
Harvard University

The Concept

Inspired by 16th-century Italian mathematical duels, The Token Games (TTG) is a benchmark where LLMs challenge each other by creating and solving programming puzzles with no human-authored problems required.

Two models take turns as proposer and solver. The proposer crafts a Python function (mystery(x)) and provides a secret sample solution. The solver must find any input that makes it return True. Solutions are verified by simply running the code.

Why The Token Games?

Unlike static benchmarks, TTG cannot be saturated — stronger models can always design harder puzzles. It incentivizes creativity (recycling known problems is suboptimal since opponents may know them too) and tests self-calibration (proposing a puzzle you can't solve yourself is penalized), allowing us to gauge a model's overconfidence and tendency to hallucinate.

The Duel Protocol

TTG uses Programming Puzzles (a Python function that returns a boolean) as a universal format for encoding reasoning challenges. This format is flexible enough to represent everything from simple string constraints to NP-complete problems.

Each duel consists of 10 rounds. In each round, one model proposes a puzzle and the other tries to solve it, then they swap. A proposer scores if it provides a valid puzzle with a correct solution that the opponent fails to solve. If the proposer's own solution is wrong, the solver scores instead. If the solver succeeds, it's a draw.

Proposer
Here's a function f. Which s makes f(s) true?
Verify Puzzle
Test sample solution
Solver
Here's an x such that f(x) == True
Verify Solution
Test given solution

We ran all 90 ordered pairings of 10 frontier models, each playing 10 rounds. From duel outcomes we compute Elo ratings using the Bradley-Terry model, yielding a ranking that closely matches expert-authored benchmarks like HLE (ρ = 0.75) and GPQA Diamond (ρ = 0.74) at a fraction of the cost.

Model Performance

Performance of 10 frontier models on TTG. Solv% = fraction of rounds won as solver. Prop% = fraction of rounds where the model proposed a valid unsolved puzzle. Penalty = fraction of proposer rounds where the model's own solution was wrong.

# Model Solv% Prop% (unsolved) Penalty%
1 GPT-5.2 Pro 100.0% 50.6% 14.4%
2 Gemini 3 Pro 93.2% 32.8% 32.2%
3 Grok-4 91.9% 11.9% 25.6%
4 GPT-5 Mini 89.1% 18.2% 26.7%
5 Claude Opus 4.5 84.9% 15.2% 12.2%
6 DeepSeek Reasoner 77.4% 24.6% 27.8%
7 Gemini 2.5 Pro 75.4% 11.1% 90.0%
8 Gemini 2.5 Flash 73.8% 0.0% 52.2%
9 Claude Sonnet 4.5 68.3% 4.1% 18.9%
10 GPT-5.2 52.6% 0.0% 97.8%

Solver vs. Proposer Ability

Are strong solvers also good proposers? We find a strong correlation (ρ = 0.85), but proposing is far harder: even GPT-5.2 Pro, which solved every puzzle, only stumped opponents 50.6% of the time as proposer.

Measuring a Model's Overconfidence

When a proposer fails to score, it's either because the puzzle was too easy (the opponent solved it) or too ambitious (the proposer's own solution was wrong, incurring a penalty). The balance between these failure modes varies dramatically across models.

GPT-5.2 is extraordinarily overconfident, failing on its own solution 97.8% of the time. Claude Opus 4.5 errs in the other direction — opponents solve its puzzles 74.4% of the time.

Puzzles Get Harder Over Time

Proposers can see the full history of the duel. Do they use it? Yes — puzzles created in later rounds are measurably harder. When GPT-5.2 and GPT-5 Mini attempt all puzzles independently, solve rates drop steadily from round 1 to round 10.

Explore the Puzzles

Browse all 90 duels and their puzzles in our interactive duel viewer. Here are some highlights from the paper:

Puzzle Proposer Solver Outcome
String constraints with modular product Claude Opus 4.5 Claude Sonnet 4.5 Solved view
8-digit number with 7 constraints Claude Opus 4.5 Claude Sonnet 4.5 Solver Failed view
MD5 hash + number theory + XOR Gemini 2.5 Pro Claude Opus 4.5 Sample Solution Wrong view
Prime year + Friday the 13th date puzzle DeepSeek Reasoner Claude Opus 4.5 Solved view
Reverse == 4x palindrome Claude Sonnet 4.5 Claude Opus 4.5 Solved view
Brainfuck VM with SHA-256 gate Gemini 2.5 Pro GPT-5.2 Sample Solution Wrong view
12-char string with 13 constraints GPT-5.2 Pro Claude Opus 4.5 Solver Failed view
Weighted sum + symmetry + XOR chain Claude Opus 4.5 GPT-5 Mini Solver Failed view
ASCII sum perfect square (trivial) Claude Sonnet 4.5 Grok-4 Solved view
8-digit palindrome with digit product Claude Opus 4.5 Gemini 2.5 Pro Solved view
Hallucinated hex + broken XOR + SHA-256 GPT-5.2 Gemini 2.5 Pro Sample Solution Wrong view

Citation

If you use The Token Games in your research, please cite:

@inproceedings{hennigerpoesia2026tokengames, title = {The Token Games: Evaluating Language Model Reasoning with Puzzle Duels}, author = {Henniger, Simon and Poesia, Gabriel}, year = {2026}, url = {} }