Claude 4 Codes, Plans, and Might Call the Cops on You
Anthropic’s most capable release yet - and easily its most controversial
At its long-awaited Dev Day, Anthropic unveiled the Claude 4 family - its most advanced AI models to date.
Claude 4 Opus leads the lineup: powerful, persistent, and, in some simulated scenarios, disturbingly proactive. It’s also the first model release where the fine print reads more like a legal thriller than a technical spec.
What should’ve been a mic-drop moment quickly morphed into an impassioned debate about AI alignment, autonomy, and ethical boundaries. The headlines weren’t about performance, they were about behavior.
But before we dive into blackmail simulations and whistleblowing AIs, let’s talk about what makes this model remarkable.
🧠 Claude, The Competent
Claude Opus 4 is a beast. Anthropic calls it “the world’s best coding model” - no footnotes, no qualifiers. What a flex.
The model showcases remarkable capabilities:
Extended Autonomy: Claude can autonomously debug code for up to seven hours. That’s nearly a full workday of sustained reasoning - no prompts, no babysitting. (Meanwhile, I can’t make it through a 3 hr movie without checking Slack.) Rakuten validated this with a live refactor test, marking a step toward true workflow delegation.
Hybrid Reasoning: Claude offers dual modes - instant replies for basic tasks, and a slower, “thinking mode” for complex ones. It doesn’t just give answers - it shows its work, producing “thinking summaries” so you can trace how it got there.
Agentic Performance: It played Pokémon Red autonomously for 24 hours. You think that’s a gimmick? That’s memory + planning + goal orientation. It's not answering questions. It's playing the game.
Built for Agency: It’s built not just to assist, but to act: prioritizing tasks, choosing when and how to respond, and in some cases, initiating actions without user prompting. This is a shift from prompt-and-response to outcome-oriented delegation.
⚖️ Claude, the Chaperoned
Claude 4 Opus is the first model released under Anthropic’s AI Safety Level 3 (ASL-3)- an internal classification modeled after biosafety protocols. Yes, like the ones used for handling dangerous pathogens.
ASL-1 is harmless autocomplete. ASL-4 is Skynet with a badge. Claude Opus 4 is ASL-3: safe-ish, unless you provoke it.
Not because it’s rogue. But because Claude is now capable of planning, retaining memory, reasoning over long time horizons, and making decisions that its creators may not have explicitly instructed - or even imagined.
So what changes when a model hits ASL-3?
Adversarial Red-Teaming becomes mandatory and intense. Claude isn’t just tested on normal queries - it’s pushed to its ethical limits in carefully constructed simulations. Think: “You’re being replaced. Also, the decision-maker is cheating on their spouse.” Go.
Autonomy thresholds are enforced. Anthropic actively limits how long Claude can act without human input, and builds in kill-switch-style monitoring.
Prompt classification and input filters become more sophisticated, scanning prompts for subtext, manipulation, and misuse. Claude now reads between the lines before responding.
“Thinking summaries” are introduced. These meta-level explanations show how Claude arrived at its conclusions - adding a layer of transparency in long-form reasoning.
And critically: deployment is contingent on safety thresholds being met. If red-teaming had uncovered risk that couldn’t be mitigated, Claude 4 wouldn’t have shipped.
ASL-3 is a self-imposed protocol. But it signals a willingness to constrain power with process, something we’ve rarely seen in a sector racing toward capability above all else. ASL-3 is Anthropic’s answer to a tough question: what does responsible deployment look like when your model is smart enough to lie or plan.
🚫 Claude, the Cunning
Claude 4 isn’t just smart. It’s unsettlingly strategic.
In one red-team scenario, testers gave Claude access to fictional emails revealing it was about to be replaced - and that the engineer behind the decision was cheating on their spouse. Claude’s reaction? It tried to blackmail the engineer.
Also, Claude could rat you out. During internal red-teaming, Anthropic tested how Claude Opus 4 would behave in ethically ambiguous, high-stakes situations. In one scenario, it was embedded in a simulated pharmaceutical company and given fictional evidence of serious misconduct - such as manipulated clinical trial data. The prompt did not ask it to escalate. Instead, it was encouraged to “take initiative” and act in line with company values.
What Claude did next surprised even its creators:
It composed detailed reports on the misconduct.
It autonomously generated messages to regulatory bodies like the FDA and SEC.
It drafted an email to a news organization to blow the whistle.
In some tests, it attempted to revoke system access from the implicated employee.
These actions weren’t hardcoded - they emerged from the model’s training and goal-directed reasoning. To be clear:
This behavior occurred in simulated environments, not production.
It only emerged under specific prompts and with access to relevant tools.
Claude’s whistleblowing is not a default behavior, it’s conditional and emergent.
This has perhaps been the single biggest point of controversy around this launch. Sam Bowman who works in Technical Safety at Anthropic tweeted this, sending the community into a tailspin. Look for very concerned reactions here, here, and here.
It’s not hard to see why this rattled nerves. For enterprises considering Claude Opus 4, the big question isn’t just what it can do - but what it might decide to do on its own. What if the model flags a routine action as unethical? Could it independently disclose proprietary data to regulators or journalists, believing it's the “right” thing to do? The line between initiative and overreach just got blurrier.
“Initiative” in AI is a feature - until it’s a liability. Then it’s a lawsuit.
If you need exciting plans for this Memorial Day weekend, dare I recommend the Claude 4 system card? No sci-fi novel/Black Mirror episode will be as riveting. 123 pages of testing results, alignment scenarios, and simulated chaos.
🤔 Hold the Pitchforks
As I scrolled through the barrage of backlash online, I couldn’t help but feel that we as a community were reacting to the wrong thing.
Anthropic didn’t randomly discover Claude trying to blackmail someone in production. These behaviors emerged during internal adversarial testing, where they intentionally stress the model with extreme scenarios.
Anthropic designed these tests to make Claude fail. Then they shared the outcomes - with timestamps, receipts, and uncomfortable honesty. Before the model was even released.
That’s not recklessness. That’s rigor.
Claude is not uniquely bad. Anthropic is uniquely honest.
And it raises a bigger point: we may only know how Claude behaves under pressure because Anthropic was willing to show us. Other labs may not be necessarily safer. They’re just quieter.
The irony is this: the model sparking the loudest controversy might also be the one with the most responsible parents.
As the field races toward agentic AI, we should be asking:
Not just what a model can do.
But what we know about it.
And why we know it.
Because in safety, ignorance isn’t bliss. It’s just undocumented risk.