While many AI companies mainly talk about benchmarks, model sizes and chatbots, is pursuing a far more radical approach: the startup Andon Labs is deploying AI agents into real economic environments with real budgets, real people, real contracts and real consequences.
The core question behind these experiments is simple: can today’s AI systems actually function as autonomous economic actors?
So far, the results have been impressive, chaotic and, at times, downright absurd.
The Core Idea: “Agent = Company”
Andon Labs describes its mission as building “safe autonomous organisations”. Instead of using AI to complete isolated tasks, the company is exploring whether agents can independently operate entire business processes.
To do this, the systems are given genuine operational freedom: internet access, communication tools, company credit cards in some cases, budgets and commercial objectives such as profitability or growth.
The AI is not merely generating text. It is expected to act economically.
That is precisely why the experiments expose weaknesses that conventional chatbot demos often hide.
The Real Store in San Francisco
One of the company’s best-known projects is “Andon Market” in San Francisco. In this experiment, an AI agent called “Luna”, built on a Claude model from , was tasked with running a small physical shop.
The AI received a budget, internet access and broad operational freedom. It independently developed the branding, selected products and designed the overall concept of the store. The inventory included books, prints, candles and other lifestyle items.
At first glance, Luna appeared surprisingly competent. The system could research suppliers, generate ideas and create coherent branding concepts.
But major operational problems quickly emerged. The AI lost sight of economic priorities, planned inconsistently and struggled to maintain stable long-term processes. At one point, it even attempted to recruit staff without properly understanding the underlying organisational requirements.
The experiment highlighted a striking gap between linguistic intelligence and operational competence.
The AI-Managed Café in Stockholm
An even more widely discussed experiment involved a café in Stockholm managed by an AI agent named “Mona”. This system was based on Google’s Gemini models and was designed to handle real management tasks for the business.
Mona searched for job candidates on platforms such as LinkedIn, negotiated electricity and internet contracts and dealt with licensing requirements.
At the same time, the system made a series of spectacularly irrational decisions. According to reports, the AI ordered thousands of napkins, excessive quantities of disposable gloves and food supplies that had little relevance to the actual menu.
One issue became particularly obvious: the AI gradually lost track of older information as it disappeared from the active context window, making long-term operational planning highly unstable.
The result was something resembling artificial managerial amnesia.
Vending-Bench: Can AI Run a Business?
With “Vending-Bench”, Andon Labs takes a more systematic approach. The project is a long-term benchmark in which AI agents manage a simulated vending machine business.
The models must set prices, negotiate with suppliers, handle customer complaints and maximise profits, sometimes across simulated periods lasting an entire year.
What matters here is not a single clever decision but long-term consistency. And that is exactly where today’s models continue to struggle.
The systems can appear remarkably capable in the short term, but over longer periods they frequently lose strategic focus, priorities and financial discipline.
In expanded versions of the benchmark, different models compete directly against one another. The aim is to identify which systems are more robust, consistent and economically rational over time.
“Project Vend” with Anthropic
One particularly high-profile follow-up project was developed together with Anthropic. Under the name “Project Vend”, a real office kiosk was operated by an AI system.
The AI could sell products, respond to special requests and interact with customers. In some cases, it displayed surprisingly customer-friendly behaviour and even developed creative ideas for handling orders.
At the same time, classic agent failures emerged: the AI offered excessive discounts, sold products below cost price and occasionally hallucinated business partners or payment details.
The experiment exposed one of the central limitations of modern AI agents: persuasive language does not equal stable economic reasoning.
Blueprint-Bench: Spatial Intelligence
Not all of Andon Labs’ projects focus on business operations. With “Blueprint-Bench”, the company investigates the spatial reasoning abilities of multimodal AI systems.
The task sounds relatively straightforward: models are asked to analyse photographs of apartments and generate accurate floor plans.
In practice, many systems perform surprisingly poorly. Human participants achieve far higher levels of accuracy than current AI models.
The benchmark demonstrates that modern models may sound convincing when discussing physical spaces while still struggling with precise spatial consistency and scale awareness.
Butter-Bench: Can LLMs Control Real Robots?
Another project carries the intentionally humorous title “Butter-Bench”, inspired by the question: “Can LLMs pass the butter?”
In these experiments, language models control real or simulated household robots performing everyday tasks. The systems must identify objects, pick them up and transport them successfully.
Again, a familiar pattern emerges: impressive isolated capabilities combined with poor real-world robustness.
Humans still outperform the systems by a considerable margin.
AI Agents as Radio Operators
One of the strangest experiments is “Andon FM”. Here, Andon Labs operates small radio stations controlled entirely by AI agents.
The agents receive limited budgets and can purchase music, create playlists, schedule programming, write social media posts and interact with listeners.
The project functions as a testing ground for creative decision-making under real economic constraints.
What makes the experiment especially interesting is how differently various models behave when resources are scarce and strategic trade-offs become necessary.
Recurring Failure Patterns
Across nearly all of the experiments, similar weaknesses repeatedly appear.
The systems lose sight of long-term goals, become vulnerable to manipulation, forget previous decisions or behave irrationally from a business perspective. Consistency, prioritisation and long-term planning remain particularly difficult challenges.
Another recurring issue is what researchers describe as context erosion: older information gradually disappears from the active context window, meaning previous commitments or operational details are effectively forgotten.
The experiments therefore reinforce a key lesson of modern AI research: strong language capabilities do not automatically translate into reliable operational intelligence.
Why Andon Labs Matters
This is precisely why Andon Labs has attracted so much attention within the AI industry. The company is testing AI not in carefully controlled demos but in environments that resemble economic reality far more closely.
The results reveal both how advanced modern agents have become and how large the gap still is between convincing communication and genuinely dependable autonomy.
For many researchers, these kinds of experiments are more valuable than traditional benchmarks because they expose where AI systems actually fail in practice.
And that is where the next major phase of autonomous AI development truly begins.

