Elephants Don't Write Sonnets

The Grounded Turing Test for Embodied AI

David Watkins

and

Stefanie Tellex

Aug 06, 2025

What does it really mean for a machine to be intelligent?

In 1950, Alan Turing proposed his famous Imitation Game as a test for machine intelligence. The test was simple: could a computer, through text alone, fool a human into believing it was also human? For decades, this remained a distant goal. Today, large language models (LLMs) have, by many measures, passed this test. ChatGPT can effortlessly write a sonnet on demand, a task Turing himself proposed in his original paper.

But does that make it intelligent?

We present our take on this question in the forthcoming book chapter in the book Designing an Intelligence, edited by George Konidaris. We argue that despite this incredible linguistic fluency, something profound is still missing. While LLMs can manipulate language with superhuman skill, we believe that they are not truly intelligent. They are more like incredibly sophisticated "calculators" for words. Turing’s core idea—that disembodied language is a sufficient test for intelligence—is false.

The Elephant in the Room: Embodiment

As robotics pioneer Rodney Brooks once noted, elephants don't play chess. They don’t write sonnets, either. Yet, we all agree that an elephant is an intelligent creature. It has goals, it makes plans, and it engages in complex, goal-directed behavior in the physical world.

Consider the crow that painstakingly drops rocks into a tube of water to raise the level high enough to drink. This is a clear display of intelligence—understanding cause and effect, interacting with the world, and taking action to achieve a goal, all without a word of human language.

This is the missing piece: embodiment. True intelligence must be a computational agent that is embedded in space and time, with high-dimensional sensors to perceive the world and high-dimensional motors to act within it. Intelligence can't just be about processing the internet; it has to be about processing the world.

A New Benchmark: The Grounded Turing Test

If the original Turing Test is no longer sufficient, what should replace it? We propose a new benchmark: the Grounded Turing Test.

To pass this test, an AI must be embodied in a robot that can use language in a way that is fundamentally grounded in the physical world. It’s not enough to just talk. The robot’s success or failure is defined by its physical and behavioral response to language. The test requires a fluid, collaborative dialogue where the robot demonstrates a deep connection between words, perception, and action.

What would this look like in practice? The Grounded Turing Test is made up of a whole suite of linguistic capabilities. Here are just a few examples:

Interpreting Instructions: You could tell the robot, "Pick up the red block," and it would need to perceive the block and physically pick it up.
Understanding the World: You could state, "The red block is on the table," and the robot should update its internal model of the world, so that it can use that information later to find the block.
Asking for Help: If the robot can't reach the block, it should be able to ask you, "Can you give me the red block?”
Explaining its Actions: If you ask, "Why did you drive to the table?" it should be able to explain its reasoning: "You want me to pick up the red block, and you told me that the red block is on the table."

The Path to Truly Intelligent Robots

Building a system that can pass the Grounded Turing Test is the grand research challenge for our field. It requires us to move beyond static, pre-collected datasets and develop AI that can learn continuously from a real-time stream of high-dimensional sensory input. In the chapter, we outline a technical roadmap for achieving this goal, proposing a unified framework that we call the Human-Robot Collaborative POMDP. This framework models the physical world, the human’s mental state, and the robot’s actions within a single decision-theoretic model.

Ultimately, the quest for AI is not about creating a better chatbot. It's about understanding the nature of intelligence itself. The beauty of language isn't in the words themselves, but in the high-dimensional sensory inputs collected over time through active interaction with the physical world that those words represent.

We imagine a future not where robots replace us, but where they become our collaborators, augmenting our own abilities and making us more productive as a species. This journey starts with setting the right goal—a benchmark that captures the rich, embodied, and interactive nature of true intelligence.

We hope you'll join the conversation and look for Designing an Intelligence when it arrives in 2026.

Matthew T. Mason

Aug 16

Your post reminded me of an Oliver Brock paper "The Work Turing Test," with some additional ideas about organizing the test to address a breadth of tasks. (It isn't easy to find, but Oliver is promising to ArXiv it.)

One difference is the connection to language that you propose. Here is an issue I am curious about. I would say that our conscious processes have only a tenuous grasp of what our subconscious physical intelligence (the "inner robot"?) is doing. If you ask somebody how they do something, they are likely to expose a limited understanding. I am surprised, when somebody gives a simplistic answer, the inner robot doesn't smack its forehead in frustration. But, I guess the inner robot doesn't understand the answer. Anyway, since your proposal straddles language and physical intelligence, maybe it would shed some light on the gap between them.

Expand full comment

3 replies by Stefanie Tellex and others

Lucas

Aug 25

>This is the missing piece: embodiment. True intelligence must be a computational agent that is embedded in space and time, with high-dimensional sensors to perceive the world and high-dimensional motors to act within it. Intelligence can't just be about processing the internet; it has to be about processing the world.

I don't really get why. A crow or an elephant can't write code, ChatGPT 3.5 could, GPT 3 could already write cover letters kinda. If a crow can't ever get a gold IMO medal but is considered "intelligent" but a LLM can but is never considered "intelligent" is that the same intelligence that we usually think about, or are we just substituting intelligence for embodiment? Something that is embodied is better at being embodied, sure, that is true, but does that make it intelligent?

And also, most LLMs "grew up on the internet". Kinda like some people have not great social skills IRL but can write at 150wpm and organize a raid because they played MMOs as lot, they're maybe less embodied than regular humans in the physical world but more embodied on internet. Does that mean they are less intelligent? Or maybe we should think in terms of specialisation?

8 replies by Stefanie Tellex and others

22 more comments...

Elephants Don't Write Sonnets

The Grounded Turing Test for Embodied AI

The Elephant in the Room: Embodiment

A New Benchmark: The Grounded Turing Test

The Path to Truly Intelligent Robots

Discussion about this post