Apple, the Armchair Critic at the AI Olympics

A timely, well-reasoned critique of LLM limitations, paired with a bigger question: what is Apple building instead?

Jun 09, 2025

Last week, Apple released a research paper called The Illusion of Thinking - a well-constructed critique of chain-of-thought (CoT) reasoning in large language models.

It’s, in fairness, a technically solid paper. It’s also, in only slightly less fairness, giving strong “guy on the couch yelling at Olympic athletes” energy. Apple, who has publicly released no foundation model, contributed little to open AI research, and is widely seen as lagging in generative AI has now shown up to critique everyone else’s approach to reasoning.

But before critiquing the messenger, let’s consider the message.

🧪 The Study

Apple’s team built synthetic puzzle environments that gradually increased in logical complexity. The idea: test how well models reason when complexity scales. They evaluated three things:

Final-answer accuracy
Coherence of intermediate reasoning
Whether models tried harder when tasks got harder

Spoiler: they didn’t.

Here’s what they found:

Collapse at high complexity. Beyond a certain logical depth, even the best LRMs fail completely. Intriguingly, their chain-of-thought effort actually decreases with complexity, despite having tokens left to "think". Like a student who quietly panics when the test is too hard, then submits a blank page to conserve dignity.
They find three regimes:
- Low complexity. standard LLMs outperform LRMs (no chain-of-thought overhead).
- Medium complexity. LRMs gain a clear edge using the chain-of-thought strategy.
- High complexity. both models collapse, failing entirely .
Flawed reasoning despite algorithms. Models don't reliably apply explicit algorithms, even when given step-by-step instructions. Instead, they pattern-match superficially, bungling tasks that require true stepwise logic.

TL;DR. Apple demonstrates that chain-of-thought doesn't scale like human reasoning. LRMs only help on moderately difficult tasks - then crumble. The “thinking” is an illusion overlaid on statistical mimicry. It’s a smart, well-structured critique. The experimental setup is elegant. And they’re not wrong: Pattern completion ≠ stepwise reasoning.

Three Thoughts I Had While Reading Apple’s AI Paper

(1) “Is this working?” is a more practical question than “Is this reasoning?”
CoT looks like thinking, but doesn’t behave like it. That’s true -but also maybe beside the point. Perhaps the question isn’t whether CoT meets some platonic ideal of logic, but whether it performs well enough to be useful, improvable, and composable. The systems being critiqued are already:

Writing production-grade code
Translating dense medical language for real patients
Powering workflows in design, research, and accessibility
And yes, serving as the foundation for whatever Apple ends up calling “AI” at WWDC

If that’s an illusion, it’s a productive one. CoT may be a shaky facsimile of logic - but it’s also a working scaffold. And scaffolds, in engineering and in AI, are how you reach higher.

The framing implies LLMs should reason like humans but don’t, without asking if that’s a meaningful or achievable comparison for autoregressive models. When you prompt a language model to “reason,” you’re asking it to simulate a structure that was never embedded, only implied. That it works at all is miraculous. That it breaks down under recursion is expected. You don’t fault a mirror for not becoming the thing it reflects.

(2) Reasoning is a system-level property, not a model-level one.
Apple evaluates isolated model behavior. But today’s capabilities are increasingly systemic - emerging from how models interact with tools, APIs, memory, and control loops. Real-world reasoning doesn’t happen in one pass. It’s iterative, compositional, and scaffolded. The best-performing AI agents today rely on:

Function calling
External scratchpads
Step verification
Memory retrieval
Multi-agent collaboration

Criticizing CoT for not producing logic is like criticizing a map for not being the terrain. It's not supposed to be.

(3) This isn’t just a research paper. It’s a strategy breadcrumb.

The timing is too poetic to be accidental. Apple - absent from model releases, eval benchmarks, and open research - drops a shot across the bow the week before WWDC.

The message: scale-first LLMs are flawed. Chain-of-thought is an illusion. Read between the lines, and you get a preview of Apple’s AI thesis:

Smaller, privacy-respecting models
On-device inference
Hybrid symbolic-neural systems
Intelligence designed for local context, not global scale

In short: not scale, but structure.

Could this be a theme we see reflected in WWDC this week? This paper, then, isn’t just academic commentary- it’s strategic narrative:“We’re not late to the AI race. We’re playing a smarter game.” Maybe they are. But if so, don’t just point out the problem. Ship the solution.

The paper reads like a philosophical justification for Apple’s strategic restraint. An argument against scale, against messiness, against perceived hype. A case for on-device models, private reasoning, symbolic grounding.

Yes, CoT has limits. But it’s gotten us far. And rather than dismiss it as illusion, we’d do better to see it as an evolutionary bridge - one layer in the journey from mimicry to mechanism.

The hard work now isn’t pointing out what doesn’t work. It’s designing what comes next. And for that, Apple, as much as anyone, needs to be in the arena.

The Change Constant

Discussion about this post