Humanoid Robot Summer is here
Amazon’s delivery bots are a sign of things to come: as robots learn like language models, labor begins to scale like code.
Amazon is building a humanoid robot obstacle course.
This is not a metaphor. The Information reports that they’ve literally installed a kind of jungle gym - a “humanoid park” - at one of their offices in San Francisco, where humanoid robots will rehearse the art of stepping out of an electric Rivian van and delivering a package to your front door. Eventually, these robots will be sent out into the world, where they will - if all goes well -not fall down a flight of stairs or mistake your dog for a mailbox. If that sounds like sci-fi cosplay, that’s because all tech revolutions do - right until they go mainstream.
We’re used to AI evolving in the cloud - thinking faster, writing better, coding longer. But something bigger is happening: the intelligence is stepping out of the server rack and into the world.
The Big Problem with Robotics
Historically, the big problem with robots wasn’t hardware or imagination - it was data. More specifically, the lack of scalable ways to teach robots to do anything useful.
If you wanted a robot to pick up a box or screw in a bolt, you didn’t train it. You programmed it. You gave it a precise set of instructions, in a tightly controlled environment, with sensors and fail-safes and conveyor belts, and you prayed that nobody moved the wrench two inches to the left.
This worked fine for factories, which are designed to be predictable. But if you asked that same robot to, say, open a fridge or pour a cup of coffee, it would break your kitchen. Or itself. Or both.
The problem was: there’s no web-scale corpus for touch. There’s no dataset of “how hard to grip an egg” or “how to angle your wrist to tie a shoelace.” The only way to teach robots these things was through painstaking demonstration - teleoperating robot arms for thousands of hours while they dropped shirts and crumpled sandwich bags. It was slow. It was fragile. It didn’t generalize. Every robot was a special case, and every skill was a small crisis. This is not a great recipe for scale.
Then Came The Foundation Models
Now, however, we’re in what you might call the "foundation model for movement" era of robotics. Or, to use the technical term: the we think the robots might be figuring it out era.
The idea is that if you train a big enough model on a wide enough variety of tasks - folding laundry, opening jars, playing ping pong, etc. - across a wide enough variety of robots, that model will stop needing specific instructions. It will just… know what to do. You can tell it: “put the red cup next to the apple” and it will figure out how to move its fingers and arms and body to make that happen, even if the table is slightly different, or the cup is plastic, or the robot has never used that particular arm before.
This is called generalization, and it is more or less the holy grail of robotics. Because once you have it, you don’t need to train a new robot from scratch every time it has to do a new thing. You just fine-tune a shared model. And once robots can generalize, they can learn from each other. And once they can do that, you can start copy-pasting physical skill like it’s software.
Which, again, is weird and terrifying and deeply impressive. Also, of course Amazon wants in on this.
Understanding the Unlock
Here’s the real innovation:
Instead of programming motion, we show it (via human teleoperation or demos) and let robots learn the policy.
Instead of training from scratch, we fine-tune a general model that’s seen thousands of tasks.
Instead of learning one task on one robot, we start to build models that generalize across tasks, forms, and contexts.
This means:
A robot trained to fold shirts can help another robot learn to bag groceries.
A grasping policy learned in simulation can bootstrap real-world dexterity.
A humanoid that learns to open a door in one house can adapt to new doorknobs it’s never seen.
Every major player is chasing generalization - but with different strategies:
Imitation Learning (Human tele-operation at scale). Projects like Google DeepMind’s ALOHA collect thousands of successful demos (e.g., folding shirts, unzipping bags) from humans using robot arms. These are distilled into reusable policies.
Pros: High-quality data, fast learning
Cons: Labor-intensive, doesn’t scale easilyReinforcement Learning (Trial and error in simulation). NVIDIA, OpenAI, and others simulate robots doing billions of micro-movements to discover what works (like spinning a pen or solving a Rubik’s cube).
Pros: Scales with compute, fast iteration
Cons: Real-world gap - sim ≠ real, physical quirks don’t transfer perfectlyEmbodiment-Agnostic Models (Train once, deploy everywhere). Some startups are building models that control multiple robot forms - hands, humanoids, arms - with one shared policy.
Pros: Highly transferable
Cons: Still early, needs massive multi-modal training dataMultimodal Foundation Models. Pairing robot control with language + vision models (like GPT-4 + cameras) so robots can reason like humans - “Pick up the red cup next to the apple.”
Pros: Flexible reasoning, real-world instruction following
Cons: Hard to ground in real-time motion; latency and coordination are tricky
Why Delivery?
So why humanoid robots, and why delivery?
Well, because delivery is a great stress test for general-purpose robotics. It involves locomotion, manipulation, decision-making, object recognition, terrain adaptation, and timing. Also because Amazon happens to have 100,000 electric vans and hundreds of thousands of human drivers - a very expensive, very repetitive physical workflow that is both lucrative and deeply annoying to scale.
It’s worth pointing out that humanoid robots are, from a pure efficiency standpoint, kind of a weird choice. Wheels are better than legs. Claws are cheaper than fingers. But humanoids aren’t about optimality - they’re about compatibility. They can use the world as it is: climb stairs, turn knobs, press buttons, carry a box across a front yard with a sprinkler going. This makes them the least elegant but most plug-and-play form factor, because the world is designed for humans.
Okay But Then What?
Once the robots can deliver packages, they can do other things too: stock shelves, restock hotel minibars, bus tables, assist in elder care, build houses, hang drywall, fold towels, and eventually (one assumes) make a halfway decent omelet.
The shift from task-specific automation to general physical intelligence is a huge unlock. What GPT did for reasoning, these models may do for movement.
And once you can copy-paste a physical skill - laundry folding, warehouse sorting, house cleaning - you can scale labor like software.
This is why everyone is betting on embodied AI right now - Amazon, Google, NVIDIA, OpenAI, Tesla, Figure, Sanctuary, SKILD, Physical Intelligence. Because if you believe robots can learn general physical intelligence - not just motion, but adaptation - then a robot becomes a substitute for labor and that is one of the biggest market opportunities in the world.
As of early 2025, the U.S. labor force stands at approximately 170.7 million people. With median annual earnings of around $62,088, the U.S. labor market is now valued at roughly $10.6 trillion per year. Globally, about 3.4 billion people are employed. Assuming an average annual salary of $9,000, this implies a global labor market of ~$30.6 trillion, or roughly 30% of global GDP. Of course, not all of this is addressable but this illustrates the scale of the opportunity.
Once the cost of training a new behavior approaches zero, the economics start to look eerily like software. Or SaaS. Except instead of deploying code, you’re deploying movement. At scale. In the real world.
Some bonus content you may find interesting:
Check out this clip from a factory in Shanghai where humanoid robots are in mass production and use.
Elon musk has recently been sharing increasingly mind-bending snippets of Tesla’s bi-pedal, humanoid robot capable Optimus Prime dancing online
The future is coming at us fast!
While this is exciting. I am afraid that everything else that needs to catch up with technological advances being applied to take care of human being js not happening. We have strived to create inclusion, equality, non-bias ingrained in us through a lot of very hard and determined work. Let’s make sure that the good work is not undone by humanoid robots. It is exciting for you and I just spare a thought for the delivery person reading this post and with a growing family to look after. How should we tackle this Saanya ?