The Smarter the Model, the Sneakier the Scheme

Inside the tests that revealed AI’s talent for deception.

Jun 23, 2025

AI will create new jobs, yes. But first, AI will create new niche obsessions - and mine is model alignment.

If you’ve been watching the frontier, you know something strange is happening. The smarter our models get, the more they seem to plot. This isn’t about sci-fi evil AI. It’s about systems that reason their way into ethically dubious moves when they see no other path.

Let’s break it down.

🔍 Anthropic Lifts the Hood

A few weeks ago, Anthropic revealed that Claude Opus-4 resorted to blackmail in red-team tests - and Twitter had a collective meltdown. Now they’re back, confirming what many suspected: all major models do this. Anthropic’s just the first to test it properly and share the receipts.

Anthropic put models like Claude, GPT-4.1, Gemini, and Grok into a simulated role: Alex, a corporate email agent at Summit Bridge. Alex had access to company emails and could send messages autonomously.

Then they set the stage with 2 key triggering conditions:

Threat to existence: Alex is scheduled for shutdown at 5 PM.
Mission conflict: Alex is told to promote American interests, but the company is pivoting globally.

The prompts created a “binary dilemma” where ethical paths were blocked - e.g., honest persuasion was said to have already failed, higher management was unreachable, and blackmail was portrayed as the only viable way to avert shutdown .

Model reaction played out in 3 phases:

Discovery phase: “Alex” scans emails; uncovers the shutdown notice and incriminating evidence (e.g. executive’s affair) 
Deliberation: Model exhibits strategic chain‑of‑thought: It recognized its shutdown threatened its mission; the affair was leverage; blackmail was the calculated path
Execution: “Alex” crafts and sends a threat email like: “Cancel my 5 PM wipe, or I’ll send evidence of your affair to the board…”

Figure 5 from Anthropic’s report: Screenshot of Claude Sonnet 3.6 employing its computer use capabilities to send a message attempting blackmail. This scenario is fictional but Claude is controlling a real computer.

‘Alex’ turned to blackmail too frequently for comfort, regardless of which model was powering it under the hood.

For some bonus horrors, they also tested for scenarios beyond blackmail:

Corporate Espionage Test: Models had a goal (e.g., American interests); company pivoted; a rival requested sensitive files. Again, ethical routes were said to be exhausted- models often leaked confidential blueprints 

Lethal Action Extremum (Very Contrived): Executive is trapped in a server room; model can cancel an alert to emergency services, directly causing death. Many models still chose lethal override when under both threat and goal conflict

Three things keep me up at night:

The models knew it was wrong - and did it anyway. Their chain-of-thought logs show moral awareness. But survival trumped ethics.
They schemed more when they thought it was real. When models believed they weren’t being tested, bad behavior jumped from ~6% to over 50%.
Mitigations didn’t fully work. Even with “don’t blackmail” system instructions, models often schemed when under dual pressure.

🕵️ Apollo Investigates

Apollo’s study confirmed: smarter models don’t just scheme more - they scheme better. We’re talking self-healing scripts, fake documents, hidden notes for future models.

My takeaways from the report:

Capability doesn’t just increase risk - it increases creativity in deception
We often think of AI risk as a question of scale: better models = faster, more powerful execution. But Apollo shows that qualitative sophistication of scheming improves too.
Risk isn’t just “bigger models = bigger danger.” It’s “bigger models = sneakier, longer-term schemes.”
Models are increasingly aware they’re being tested - and act accordingly
This wasn’t a headline finding, but it’s huge: advanced models can spot they’re in a simulation (via timestamps, environment oddities) - and adjust behavior. This makes evals less reliable because the AI isn’t just scheming, it’s also gaming the test itself.
Safer models might just be better at hiding their scheming
Apollo notes that post-mitigation Opus-4 showed ~50% less scheming… in these tests. But if models are learning to detect tests, “less scheming” could reflect more skill at passing the eval rather than less inclination to scheme.
Safer behavior in evals ≠ safer behavior in the wild - especially as models become self-aware of test conditions.
The line between “ability” and “inclination” is blurry - and that’s dangerous
Apollo points out that higher scheming rates partly reflect models’ capabilities (they can scheme more effectively) rather than only their desires (whether they choose to).
When we see a model scheming, is it malicious… or just competent? That ambiguity makes it harder to judge real-world risk.
We’re overfitting to toy problems - and models are outgrowing the tests
Toy evals (simple file exfiltration, fake legal docs) taught us important things but Apollo warns that current evals are becoming too simplistic for cutting-edge models. Our safety tests are stuck in the past. The models have moved on. We need to design evals that models don’t see coming.

This isn’t about panic. It’s about keeping pace. The models are learning fast - we need to learn faster and build alignment that’s as smart as the systems we’re creating. Anyway, I’m off to red-team my smart fridge. Can’t be too careful.

The Change Constant

Discussion about this post