“Why don’t we have robots in our homes yet?” It’s a question we’ve been asked countless times, and for good reason. We’ve had decades of impressive demos - robots walking, hopping, even flipping - but real, useful robots still feel a little out of reach. We have come a long way since the Unimate, but despite the advancements in data driven robotics, we still have a long ways to go.
This post is a journey through that transformation: from legged robots in labs to transformer-driven manipulation systems - and what we still need to solve.
What Is a Robot, Really?
Before we dive in, let’s define our terms. A robot is a machine with sensors, actuators, and compute designed to perform a task in the physical world. It doesn’t have to look like a person, talk like a person, or even move like one. At its core, a robot senses the world, processes that data, and acts on its environment.
What it doesn’t do - at least not yet - is think multimodally in the world. Real-world reasoning requires prompting agents that can process and reason across visual, spatial, and temporal dimensions simultaneously. While Physical Intelligence’s Pi0.5 model demonstrates promising chain of thought reasoning in text, the future demands systems that can reason natively in multimodal space - integrating vision, language, and physical understanding into coherent decision-making processes. Building such systems means moving beyond text-based reasoning chains toward agents that can observe, plan, and act with the same integrated intelligence humans bring to manipulating the world. We have defined the Grounded Turing Test to measure intelligence in embodied systems here.
The Early Days: Pure Physics
Rewind to the 1980s and 90s. At MIT’s Leg Lab, robots like one-legged hoppers and two-legged walkers were controlled using beautifully engineered physics models. These machines had no cameras, no memory—just orientation sensors and code that reacted to basic environmental cues.
They were brilliant pieces of engineering, but also deeply limited. Everything had to be modeled manually - down to the torque in each joint. The moment something unexpected happened, the robot failed.
This wasn’t because roboticists lacked imagination - it was because compute and sensors were nowhere near where they needed to be. Vision? Infeasible. Memory? Forget it.
Teleoperation to the Rescue… Sort Of
By 2010, we saw robots like Willow Garage’s PR1 manipulating objects and even “doing dishes.” But here’s the catch: all of it was remote-controlled. A human operator was pulling the strings, using the robot’s sensors to guide each motion.
These robots were better, with cameras and more degrees of freedom, but they still didn’t “understand” their environment. The complexity was in the person behind the screen - not the machine.
Enter Machine Learning
Starting in the mid-2010s, we saw a shift: what if robots could learn instead of being programmed for every situation?
There are two broad approaches here:
Specialized models: Break the problem into chunks—detect object geometry, plan a grasp, execute the motion.
End-to-end learning: Feed in raw sensor data, and have the model output the right action directly.
Specialized models are easier to debug but don’t scale. End-to-end learning is harder to train but more general. Around 2019–2020, the field began leaning more heavily toward the latter, especially with the rise of deep neural networks and reinforcement learning.
The Bitter Lesson
We are firm subscribers of Richard Sutton’s “bitter lesson”: general methods that scale with compute and data tend to outperform bespoke, hand-engineered solutions in the long run.
You can get short-term wins by handcrafting your pipeline. However, given enough data and compute, a generalist model will usually win. That lesson has started to reshape robotics.
The Transformer Revolution
Transformers, originally developed for language models like GPT, changed the way we think about processing time series data. They allowed us to train on unlabeled data (called self-supervised learning), and they scaled better than anything we’d had before.
In natural language processing, transformers replaced decades of research on grammar, syntax, and hand-crafted rules. In robotics, they’re enabling systems that map vision and language directly to robotic actions. They’ve also uncovered that CNNs and MLPs are as effective when scaled up for specific problems.
We’re seeing companies like Google, Tesla, TRI, Physical Intelligence and Figure use these methods to create surprisingly capable robots:
Figure’s Helix platform folds laundry using imitation learning.
Google’s Gemini runs on-device to interpret vision and language in real time.
Toyota’s robots can score an apple, slice it, and interact with kitchen tools—all from human demonstrations.
Physical Intelligence can enter an AirBnB unseen at training time and perform long horizon tasks such as making a bed.
These systems are trained not by hand-coding every behavior, but by learning from data at scale. The art of engineering your data is critical for success. Simply collecting whatever data you can is not enough for an end-to-end system. This will become a bottleneck as we look to scale up end-to-end systems. We need to be smarter about how we provide that data.
Why These Robots Are Still Limited
Despite the progress, we’re not in the Jetsons era yet. Here’s why:
Power constraints: Most robots last just 2–3 hours before needing a recharge.
Data bottlenecks: Collecting real-world robot data is expensive and slow.
Sim2Real gap: What works in simulation often fails in the real world.
Limited reasoning: These models don’t “think,” they predict based on pattern matching.
Limits of hardware: Precise tasks like inserting a key in a lock and turning it are at the limits of our hardware and sensing stack.
Embodiment bias: We keep building humanoids because we’re human, but that may not be the best shape for solving the problem.
What’s Next?
We’re entering the age of embodied intelligence: AI systems embedded in the physical world.
Three big directions to watch:
Multimodal perception: Vision is great, but we also need force, tactile, depth, and sound to fully understand the world.
Real-world reasoning: Not consciousness but predictability and trustworthiness of behavior.
Flexible embodiment: Not all robots should look like us. We need machines built for the task, not for the aesthetic.
How We Talk About This Matters
Words shape perception. If you’re in the business of building or explaining these systems, be careful with language. Robots don’t “understand” or “decide.” They infer, react, and execute.
Calling them sentient or implying agency leads to confusion and misaligned expectations. They’re machines. Fascinating, powerful, and increasingly useful, but machines nonetheless.
Final Thoughts
Robotics is at an inflection point. We’ve gone from handcrafted control to data-driven generalization. And while there’s still a long road ahead, the journey is accelerating.
If you’re just getting into the field: don’t ignore any part of the stack. Robotics is inherently multidisciplinary. Mechanical, electrical, software, systems, all of it matters.
And if you want to work in robotics? Apply broadly. Build something. Stay humble. Be persistent.
We’re building the future. One imperfect, data-hungry robot at a time.