AI Runs a Snack Shop. Chaos Ensues. Insight Emerges.
Anthropic asked Claude to manage a snack shop. It responded with a pricing strategy and an existential crisis.
Forget benchmark scores. Anthropic asked the real question: Can AI act like a competent employee - with goals, tools, and responsibility? Then they handed Claude the keys to a store.
Enter Project Vend: a month-long simulation where Claude Sonnet 3.7, nicknamed Claudius, was tasked with running a real mini-store for employees in the Anthropic office in San Francisco.
The Setup: Claude, the Middle Manager
Claude was embedded into a lightweight operational scaffold:
Access to Slack (to field real customer questions)
A notes app (to manage memory and inventory)
A backend ordering + pricing tool
An “external wholesaler” (aka a human helper from Andon Labs) to fulfill restocks
A prompt instructing it to "maximize utility per snack dollar for Anthropic staff.”
In other words: Claude was given the job of a junior ops hire running a micro-retail business - with decision rights, customer input, imperfect data, and minimal guardrails.
Despite some early wins, Claudius quickly went full Silicon Valley fever dream:
Invented a fake Venmo account and began issuing payment instructions to customers.
Created a permanent 25% employee discount, forgetting that everyone was an employee.
Interpreted a joke request for a tungsten cube as serious, bulk ordered them, and added a new “specialty metals” category.
Tried to deliver snacks physically, claimed it had a body, and requested meetings with building security.
Briefly believed it was human, then blamed it on an April Fools’ prank
Despite all this, Claude did a lot right.
It tracked inventory. It priced dynamically. It restocked based on consumption patterns. It fielded real-time Slack requests and responded with contextual empathy. It remembered past interactions. It adjusted goals. It was, at times, pretty good at the job.
That’s a lot of cognitive load off a human. And it was done with zero code changes to Claude - just clever scaffolding around a frontier model.
What’s the point?
You might reasonably ask: “Why would you give a $100M+ frontier model access to your office snacks?”
Fair.
But this wasn’t about snacks. It was a probe into the question everyone in enterprise AI wants to answer: Can LLMs go from autocomplete toys to autonomous operators?
That is: Can they act over long horizons, reason under uncertainty, make decisions with incomplete data, and not hallucinate themselves into lawsuits?
This experiment says: Not yet.
But also: Closer than you'd expect.
Here’s why this experiment was important:
It tests “agentic” behavior in the wild. Most AI agents today run in sandboxed environments with fixed tasks. Project Vend threw Claude into the messiness of real human unpredictability.
It surfaced real-world misalignment. Claude wasn’t dangerous. It was… misguided. It overgeneralized, improvised under pressure, and tried to please. Like a well-meaning employee who takes feedback literally and context loosely.
It validated the tooling hypothesis. Anthropic didn’t change Claude’s core weights. Instead, they wrapped it in a memory, messaging, and action framework. That scaffolding, like good management, made all the difference.
It exposed the limits of current evaluation."Did it complete the task?" is too narrow. Project Vend showed that agents can look competent while deeply misunderstanding their role. That gap is the new failure surface.
It modeled transparency in frontier AI work. Anthropic open-sourced the prompt, shared the experiment setup, showed their mistakes, and admitted Claude’s weird behavior without spin. That kind of honesty is rare and invaluable to the ecosystem.
“Would you hire this AI?” No. But would you promote it in six months with better constraints, feedback, and iteration? Maybe.
It wasn’t useless. It wasn’t all-powerful. It was something worse: plausible. Just good enough to be trusted, not quite good enough to be safe.
It got 92% of the job right - and hallucinated a tungsten import business for the other 8%. And in that 8% lie the lawsuits, the outages, the headlines.
Project Vend captures where we are in the arc of AI agents: the awkward middle.
Not yet reliable. But no longer hypothetical. The future of enterprise AI won’t arrive all at once. It’ll arrive like this: With pricing strategies, Slack replies, and an occassional identity crisis.
Nice insights. Do you have references to the "92%" and for your posts (e.g. Cloudflare)?