When AI Plays Dungeons & Dragons: Researchers Test Agents at the Gaming Table

25. January 2026

Researchers at the University of California San Diego have had large language models play Dungeons & Dragons – not for fun, but as a benchmark for long-term planning, team coordination and rule compliance. The result: large models like Claude 3.5 Haiku play surprisingly well, but fail at long scenarios. And that shows where business agents also reach their limits.

It sounds like a thought experiment: you take the world’s most powerful language models and seat them at a virtual gaming table. They take on characters, roll dice, fight, plan strategies and interact with a Dungeon Master. Not because the researchers are nerds – though they probably are – but because Dungeons & Dragons is a perfect testing ground for agentic AI. It’s dialogue-based, has strict rules, requires long-term strategy and demands role and character consistency. In short: it’s complex, multi-stage and unpredictable. Exactly like the real world, in which AI agents will soon conduct negotiations, control processes and make decisions.
The Idea: D&D as Benchmark for Agents

The researchers’ goal was to test how well LLMs function as agents in complex, long-running scenarios – planning consistently over many turns, following rules and acting as a team. D&D was chosen because it combines all of this: it’s not just a game, it’s a system of rules, states, resources, goals and social interaction. A perfect representation of what business agents will later need to do.
How the AI Played D&D

The researchers built a D&D simulation environment – “D&D Agents” – with a game engine and tools through which the LLMs could query states and execute actions: movement, attacks, spells. The models took on various roles: Dungeon Master (rule and world management), player characters and monsters in tactically complex combat scenes. Twenty-seven known combat scenarios were simulated – “Goblin Ambush”, “Klarg’s Cave” – and the LLMs played against each other and against around 2,000 experienced human players who served as a benchmark.

A typical sequence: the game engine provides maps, resources, permitted actions and the current state, and functions as a “guardrail” to reduce rule-violating hallucinations. The LLM describes thoughts and considerations, selects an action – “Move behind cover and attack with longbow” – and calls the corresponding move via a tool. The new state – positions, hit points, effects – flows back as context into the model, so it must plan over many rounds.
Which Models, Which Metrics

Several large models were tested – including Claude variants, GPT-4 class, DeepSeek-V3. Smaller open-source models served as reference and performed significantly worse. Evaluation was along several axes:

Function usage: Does the model use the available tools correctly and efficiently? Parameter fidelity: Do parameters like target, range, objectives align with D&D rules? Action quality and tactical optimality: Are the moves sensible, do they use cover, focus on targets? State tracking: Do the agents correctly keep track of resources, states and positions over many rounds? Acting quality: Do they stay in role, behave in character and remain narratively consistent?

Popular summaries report that Claude 3.5 Haiku performed best, closely followed by GPT-4, whilst DeepSeek-V3 clearly lagged behind.
The Results: Good, But Not Good Enough

Large closed models show surprisingly high competence in D&D-like, rule-based dialogue and game situations. They can follow rules, choose tactically sensible moves and remain convincingly “in character”. A goblin taunts its opponents, a paladin delivers heroic speeches. The models implement role attributes consistently linguistically. That’s impressive.

But: smaller open-source models had clear problems delivering stable, consistent simulations. They hallucinated actions, ignored rules, lost track. Over longer play time, all models’ performance visibly declined. The longer and more complex a scenario, the more frequently errors occurred in state tracking, resource management and consistent strategy. That’s the central problem: short scenarios – no problem. Long scenarios – chaos.

Iterative prompting and explicit shared goals – instead of individual goals – improved collaboration and narrative coherence between multiple LLM “players”. But even that helped only limitedly. The models can plan, but not over 50 turns. They can cooperate, but not when the context becomes too long. They can follow rules, but only as long as they keep them in view.
Why D&D as AI Benchmark Is Important

D&D forces agents into long-horizon planning, team coordination and strict rule compliance in an open, language-driven space – similar to many real, multi-stage business or negotiation scenarios. The “D&D Agents” benchmark with clearly defined scenarios, tools and metrics creates a reproducible environment in which one can compare prompting strategies, tool policies, memory mechanisms or new agent algorithms.

The researchers see the method as a blueprint to realistically test and improve multi-stage negotiations, cooperation games or business strategies with LLM agents in future. Because what fails at D&D will also fail in real life: if an agent forgets what the goal was after 20 interactions, it’s not suitable for a three-month negotiation process. If it doesn’t understand when to cooperate and when to compete, it’s not suitable for a complex project.
What Remains

The AI plays D&D surprisingly well – for a few turns. Then it loses the thread. That’s the pattern: short tasks – brilliant. Long tasks – problematic. Simple scenarios – no problem. Complex scenarios with many states, actors, goals – errors accumulate.

That’s important because the promises of agentic AI are long, complex and multi-stage. An agent that can only think ten steps ahead isn’t an agent, but a chatbot with tools. An agent that forgets what the goal was after 30 minutes isn’t a partner, but a risk.

D&D shows where we stand: impressively far, but not far enough yet. The models can play. But they can’t yet win – not when the game lasts longer. And real life always lasts longer.

Two New AI Labels for Music: Why Transparency Alone Won’t Solve the Problem

The New Soft Skills for Early-Career Professionals: Why AI Is Making Human Capabilities More Valuable

AI Leap: Why Estonia Is Making AI a Core Skill Instead of Banning It

Malta Is Giving Its Citizens ChatGPT Plus: When AI Becomes Public Infrastructure

AI Dubbing Under Fire: Why Germany Is Particularly Sensitive to Synthetic Voices

Midjourney vs Disney, Universal and Warner Bros.: Why the AI lawsuit is putting pressure on both sides

AI Influencers Are Moving into the Mainstream – But Trust Remains Critical

Claude Design: how Anthropic aims to reshape the design process with AI

Two New AI Labels for Music: Why Transparency Alone Won’t Solve the Problem

AI Dubbing Under Fire: Why Germany Is Particularly Sensitive to Synthetic Voices

Innovation explained: Loop Engineering

Midjourney vs Disney, Universal and Warner Bros.: Why the AI lawsuit is putting pressure on both sides

The New Soft Skills for Early-Career Professionals: Why AI Is Making Human Capabilities More Valuable

AI Agents in the Real World: The Unusual Experiments of Andon Labs

Harness engineering: why reliable AI is built around the model, not inside it

Copilot Tasks: When To-Do Lists Start Completing Themselves

When AI Plays Dungeons & Dragons: Researchers Test Agents at the Gaming Table

Ähnliche Artikel

Kommentare

LEAVE A REPLY Cancel reply

Follow us

FUTURing