The rule-following crisis in AI game playing

Despite remarkable advances in reasoning capabilities, large language models still struggle with a basic task: simply following the rules. In the recent Kaggle GameArena chess competition, 78% of losses by Google’s Gemini-2.5-Flash weren’t due to poor strategy but to attempting illegal moves.
This disconnect between understanding and execution reveals a fundamental limitation in how current AI systems interact with structured environments.
The traditional solutions have proven inadequate. Fine-tuning models on game-specific data is expensive and risks degrading performance on other tasks. Hand-coded rule checkers require extensive human labor for each new game and break easily when game variations emerge.
Both approaches fail to scale across the diverse landscape of potential applications where AI agents need to operate within strict constraints.

Teaching AI to write its own safety rails
AutoHarness takes a different approach: instead of humans writing code to prevent AI failures, the AI writes the code itself.
The system treats code generation as a kind of search problem, using Thompson sampling to explore different programming solutions while learning from feedback from the environment.
The system that works through an iterative loop:
- The AI proposes code functions that either generate legal moves or verify proposed actions.
- When the code makes mistakes, the environment returns clear error messages.
- A key component groups these errors together, and the AI uses that feedback to improve its code.
This process continues until the system reaches perfect accuracy or runs out of time.
What makes this approach especially powerful is its flexibility?
The researchers tested three configurations:
- Legal-move filtering: Code generates legal moves that the AI then ranks.
- Move verification: Code checks whether the AI’s proposed moves follow the rules.
- Full policy mode: Code handles the entire game-playing policy without needing the language model during runtime.
Small models with tools beat large models without them
The results challenge a common assumption about model scaling. When equipped with its self-generated harness, Gemini-2.5-Flash achieved a 56.3% win rate against the much larger Gemini-2.5-Pro in two-player games.
The larger model, relying only on its internal reasoning, managed just 38.2%.
Across the 145 tested games, the system delivered:
- 100% legal action rates across all environments
- An average of 14.5 code refinement iterations before converging
- Faster convergence for simpler games, often under 10 iterations
Games like Chess, Othello, and Cryptarithm required more iterations, while simpler games converged much faster. The generated code also showed a surprisingly deep understanding, from implementing Universal Chess Interface parsing to handling probabilistic reasoning in Minesweeper.

Why code beats scale
The success of AutoHarness reveals several important insights about the future of AI systems. First, it validates a hybrid approach where neural networks handle high-level synthesis while symbolic logic manages strict constraints.
The language model’s role shifts from trying to internally simulate complex rule systems to generating verifiable code that handles those rules externally.
This division of labor plays to each component’s strengths. Language models excel at understanding intent and generating creative solutions, but struggle with precise state tracking. Traditional code excels at rule enforcement and state management but requires human expertise to write.
By combining them, AutoHarness achieves reliability that neither could provide alone.
The cost implications are equally significant. A smaller model that can synthesize appropriate tools becomes more valuable than a larger model operating without them.
This suggests that the path to more capable AI agents may not require ever-larger models but rather models that can better construct their own cognitive scaffolding.
The path forward for reliable AI agents
AutoHarness addresses one of the fundamental bottlenecks in deploying AI agents: reliability.
By offloading rule compliance to verifiable code rather than relying on the model’s internal simulation, it offers a scalable path toward autonomous agents that can operate safely in constrained environments.
The researchers outline several promising directions for future work.
One involves distilling the domain-specific experts back into the base model, creating a recursively self-improving system. Another explores building libraries of reusable harnesses that could be adapted across similar tasks.
The approach could also extend to more complex multimodal environments like Craftax and Terra Nova.
The framework demonstrates that sometimes the most powerful enhancement for an AI system isn’t more parameters but better tools, especially when those tools come from the AI itself.
As we push toward more autonomous AI systems, the ability to self-generate reliable interfaces with the world becomes crucial. AutoHarness shows that this self-construction is not only possible but can lead to systems that outperform much larger models while remaining cost-effective and verifiable.
In the evolution of AI agents, it seems that learning to build better tools may matter more than simply growing bigger brains.




