Elephants Don't Write Sonnets

Aug 6

The Grounded Turing Test for Embodied AI

24 Comments

Matthew T. Mason

Your post reminded me of an Oliver Brock paper "The Work Turing Test," with some additional ideas about organizing the test to address a breadth of tasks. (It isn't easy to find, but Oliver is promising to ArXiv it.)

One difference is the connection to language that you propose. Here is an issue I am curious about. I would say that our conscious processes have only a tenuous grasp of what our subconscious physical intelligence (the "inner robot"?) is doing. If you ask somebody how they do something, they are likely to expose a limited understanding. I am surprised, when somebody gives a simplistic answer, the inner robot doesn't smack its forehead in frustration. But, I guess the inner robot doesn't understand the answer. Anyway, since your proposal straddles language and physical intelligence, maybe it would shed some light on the gap between them.

Expand full comment

Stefanie Tellex

Would love to read his paper! We cited other variants of the Turing Test of our chapter, but if he hasn't posted his, then I feel less bad missing it. I agree some of what our inner robot is doing is not accessible to our conscious language mind. But I believe a surprising amount is. As a recent example, you recently coined the term "ulnar grasp" and now you see ulnar grasps everywhere. I think many (but not all) things our inner robot is doing are like that - we can create and define a language to talk about them and bring them into our consciousness.

Expand full comment

Great points both! Adding to this... why not use our bodies to teach robots instead of language, that solves the non-linguistic inner robot issue ;-]

Thanks for starting the blog, glad to see embodiment make a come back!

Expand full comment

Stefanie Tellex

For sure, we imagine learning from all kinds of different inputs. There is a larger embodiment gap from human bodies -> robot bodies which makes it harder to transfer, but easier to collect lots of data (hello YouTube!). But at the end of the day you need to close the loop - inputs->outputs. I think there is a ceiling to 'off policy' watching videos or demonstrations or whatever, and we'll need on-policy closed loop data to make on-policy closed-loop behavior work really well.

Expand full comment

>This is the missing piece: embodiment. True intelligence must be a computational agent that is embedded in space and time, with high-dimensional sensors to perceive the world and high-dimensional motors to act within it. Intelligence can't just be about processing the internet; it has to be about processing the world.

I don't really get why. A crow or an elephant can't write code, ChatGPT 3.5 could, GPT 3 could already write cover letters kinda. If a crow can't ever get a gold IMO medal but is considered "intelligent" but a LLM can but is never considered "intelligent" is that the same intelligence that we usually think about, or are we just substituting intelligence for embodiment? Something that is embodied is better at being embodied, sure, that is true, but does that make it intelligent?

And also, most LLMs "grew up on the internet". Kinda like some people have not great social skills IRL but can write at 150wpm and organize a raid because they played MMOs as lot, they're maybe less embodied than regular humans in the physical world but more embodied on internet. Does that mean they are less intelligent? Or maybe we should think in terms of specialisation?

Expand full comment

Stefanie Tellex

I think one reason that "intelligence" has been a moving target for the past 50 years is that it is a multi-faceted thing, so people figure out one facet (like Chess), and then other people point to a different facet and say "what about computer vision?". It doesn't mean that Chess wasn't part of what it means to be intelligent just because it didn't solve computer vision; we want "all of the above."

From that perspective, there is a facet that is missing from ChatGPT, and I think embodiment is a good word to describe that facet. We are pointing to crows and elephants because these creatures have some of the behaviors and capabilities that go along with this facet of intelligence, without having language at all. From a capabilities perspective, we mean processing high-dimensional, high-framerate sensor input and producing high-dimensional, high-framerate actuator output.

I think you are correct to say there is a spectrum of embodiment (maybe "missing" was too strong a word above), and LLMs have some (certainly processing images is a step in the right direction), but LLMs clearly have a lot less than crows and elephants. There are things crows and elephants can do that LLMs definitely cannot do (yet), and our Grounded Turing Test was trying to get at some of them.

Expand full comment

Thanks, that clarifies things! I think this is a better way to explain things that what's in the article.

Starting from "ChatGPT is better than me at code but also if I put it in a robot it'll be worse than a well-trained dog, way worse. What is missing and how can we fix that" feels to me way more constructive than "LLMs are not intelligent but crows are". I feel like the way it's currently put, it begs the question of "so what is intelligence exactly if ChatGPT is not intelligent but a crow is?", in a way that distracts from the problem of "we want better robots, what is missing, how do we achieve that?".

I also feel, and correct me if I'm wrong, that in that reply you're okay to say that intelligence is highly contextual, but in the main article it's more about "we're trying to crack what Intelligence is like and for that we need Embodiment", which feels like two opposing perspectives.

I would argue that even embodiment is contextual: a LLM agent with lots of API keys can interact a lot with the cyberspace, in a way an elephant or a crow could never, especially in a purposeful way. LLMs (well, LLMs with scaffolds so "agents"?) can now read github issues and make pull requests and react to reviews in a way a remote contractor can. They're not yet I think at the point where they can watch a youtube video and comment on it (I think they use the autogenerated subs?), but that's progressing too. They're very well embodied for a "cyberspace native """animal"""". In a way this supports your theory: good LLMs with good scaffolding are more useful that those constrained to a chat interface, and what we call "scaffolding" could be "cyberspace embodiment". That way they are able to display surprising amount of intelligence to people that were used to chat interface LLMs. (Aside from that, there was a viral video on twitter of a turtle on a small skateboard. I've never been a big fan of turtles because they're so slow. But in that video the turtle was speeding around a cat and playing with it, displaying some "social"/"playing" skills that I've never seen a turtle display, and I think that's a good metaphor on what embodiment can be about).

Expand full comment

Stefanie Tellex

Yup, I think there is a knob around dimensionality/frame rate of the input and output. The turtle example is cool because it's kind of turning up the speed of the output, and suddenly it looks more intelligent. Similarly it's amazing how much cooler robot videos look when the robot is moving more quickly. Makes me think about the Chinese Room argument: I think it is wrong because imagining a person looking things up in a book makes you imagine something running at human speeds instead of a computational process running at computer speeds in order to generate *behavior* at human speeds. There are some theory-of-computation things to do here to formalize this intuition.

Your feedback on the story is great - my co-author has pointed out that this blogging thing is a process, so we will plan a post more from this perspective. Section 1.3 in our Chapter is already all about what is missing, and 1.4 on how to achieve it. Stay tuned!

Expand full comment

Stefanie Tellex

Aug 25Edited

Does this tag a person?

https://substack.com/@davidjwatkins

Expand full comment

Yeah very good point about framerate/speed/chinese room. Someone talking slowly appears less intelligent than someone talking more quickly, if they say the same thing. Here's the turtle: https://www.youtube.com/watch?v=bVbtAYPSapw

And yeah it's a process! It's super good that you're choosing to share this.

Expand full comment

Stefanie Tellex

Super cool - the speeds and motion of moving on the skateboard are a lot like what he has to do when swimming in the water. But still pretty impressive out-of-distribution behavior!

Expand full comment

I hadn't noticed that, interesting!

Expand full comment

Continue thread →

The argument you are making - that disembodied intelligent use of language is not 'true' intelligence - has also been made repeatedly by Yann LeCun, the former head of Meta's AI effort. As it happens, in two recent posts on my blog, I asked a couple of AI chatbots what they thought of this argument.

Both Kimi K2 and ChatGPT-5 gave a pretty good demonstration that they understood the issues. And I think that they both provided a nuanced answer that physical grounding of language may be nice when discussing block worlds or cats on mats, but that it is not strictly necessary when discussing ideas. I report verbatim the responses of these two chatbots to the question, but Claude, Gemini, DeepSeek, and ChatGPT 4.5 all also gave quite good responses to this same question, as well as some others.

Expand full comment

Stefanie Tellex

Which post? I couldn't tell from the titles. I agree that it's not strictly necessary when discussing ideas. And that's something that really surprised me - I would have bet against it, and I would have been wrong. But I think there is a HUGE gap between discussing ideas and going out and building them in the physical world. Maybe we could even quantify it in a theory of computation sense, in terms of the dimensionality of the input space. Words coming in as tokens vs images coming in at camera frame rates. Words coming out as tokens vs actions coming out at 100hz for minutes or hours or days.

Expand full comment

Sorry. The two posts I was referring to were titled "Grok 4 and Kimi K2", and "ChatGPT 5 Drops". Search for the string "LeCun" in each to find the passages I meant.

I think you are right, though, to point out that embodied human intelligence routinely deals with input bandwidths far higher than what a chatbot has to deal with. But, ironically, that is precisely why it is the white-collar jobs that AI is currently threatening, rather than the higher bandwidth blue-collar jobs!

Expand full comment

Stefanie Tellex

Very cool, I took a look. So I think there is a difference between text about physical grounding of language, and actually physically grounding language to lead to goal-directed behavior in the physical world. To riff on one of the examples Kimi gave, if someone says "this coffee is cold" to a waiter, the waiter is supposed to infer that coffee is supposed to be hot, and then go to the kitchen to get hot coffee. Carrying out these actions in the physical world requires processing high-dimensional high-framerate perceptual input such as camera and producing high-dimensional high-framerate movements to go to the kitchen, fill a mug, and bring it back to the customer full of hot coffee.

Expand full comment

Aug 14Edited

AI would certainly be more useful if it could use language and reasoning in a way that's grounded in the physical world. But I'm not sure I agree that something like the red block test you propose is essential to establish its being intelligent in a way that really does justice to what we mean to by the word. I think I might be satisfied that it was an intelligent being if it were possible to teach it something in more or less the same way that one would teach a person.

For instance, suppose there were a game somewhat like chess that the AI had never played, and that was not mentioned at all in its training data. Would it be possible to teach it to be a good player just by talking to it? You tell it the rules, ask if it has any questions about how the game works, then you play a couple games. Then you start talking about strategy. So you tell it, "your blue pieces give you power, but the red pieces give you flexibility, ways of surprising your opponent. It's not too bad to have a bit fewer red pieces than your opponent, but it's very important to have at least as many blue pieces as he does." So then you set up the board in a way the gives the AI a choice to become one down in red or one down in blue, and ask it what its next move would be. Maybe a while later you show it one clever trick you can play with blue , then ask if it can think of another one. Later, you tell it about styles of play: Some opponents move in slow, subtle ways towards dominating the board, others push grimly towards the endgame. You give it an example of each style of play. Then you ask it to play a game against you playing in the slow, subtle style, and another in grim-charge-towards-the-endgame style. And so on.

Current AI's intelligence is mostly a giant mnemonic trick: It's learned by seeing a milliion examples of everything. And of course it could learn to excel at the imaginary game with red and blue pieces in the same way. But if it could learn by instruction, of the kind that works with people, I think that would also satisfy me that it was genuinely intelligent.

Expand full comment

Stefanie Tellex

The red block test is meant as an illustrative example; I don't think any "point solution" that handled just those examples would pass our test. We just used it as the flavor of the sorts of language use we were getting at.

Your idea about teaching a game with language reminds me of Branavan's work on learning to play Civilization 2 better by reading the instruction manual (Branavan, S. R. K., David Silver, and Regina Barzilay. "Learning to win by reading manuals in a monte-carlo framework." Journal of Artificial Intelligence Research 43 (2012): 661-704.) and even older work by Chapman and Agre (Agre, Philip E., and David Chapman. "What are plans for?." Robotics and autonomous systems 6.1-2 (1990): 17-34.). I suspect that LLMs can do this or nearly do this now in "text space", and for me it doesn't pass our bar. We're planning a post that gets at this using chess as an example. There are many different perceptual ways a chess game state can look but they all boil down to the same low-dimensional state representation.

Expand full comment

Aug 17Edited

I understand that your red ball examples are just a stand-in for a full test — just a sample of the things you would ask the AI to do.

You are right that chess and similar games are low-dimensional worlds. There are only a few things to know — which pieces are on the board, and what row and column each is in. But it seems like it would be possible to teach an instructable AI all kinds of things by starting the instruction with low-dimensional worlds. For instance you could teach it to recognize and move a red ball by showing it a low-dimensional version of the situation — a cartoon where the ball is a red circle, the surface it’s on is a brown rectangle, etc., then once it understands that, add more dimensions to the info by using photos, explaining that they show the real-world equivalents of what the cartoon showed. So instructability seems like a route by which AI could become intelligent in your sense, as well.

Whether it is possible to make AI more instructable I do not know. Somehow implanting a big picture understanding of things and the rules that govern them

is the approach that was originally used in attempts to build AI, and failed. (Is that what the articles you cite are about?). But whether it is possible or not I continue both to think and to feel intuitively that being instructable in my sense is the best test of what we mean by intelligence.

< I suspect that LLMs can do this or nearly do this now in "text space"

Really? I asked GPT 4.o whether current AI could do this and this is what it said.

<Your thought experiment about teaching an AI a complex game like chess purely through human-like instruction (rules, examples, strategic advice, styles of play) gets at a central question in current AI capabilities: Can modern AI models learn abstract structure and develop competence through explanation and interaction, rather than brute-force data exposure? Bottom Line

No current AI can become a highly skilled player of a complex game solely through human-style instruction and feedback.

But:

• It can simulate understanding well enough to follow rules and play passably.

• It can mimic strategic language and improve short-term behavior within a session.

• It gives a convincing illusion of learning — but it doesn’t actually develop skill in the way even a serious amateur human does.

But this is simulation of learning, not true cumulative learning:

• There's no stable model being updated under the hood.

• Any "learning" fades if you start a new session, unless you re-teach it.

• It doesn't self-correct in the way a human would ("Ah — I lost blue pieces early, and that always goes badly").

Expand full comment

I understand your bias towards focusing on "Chess" as the "goal", but what we are talking about is creating AI that learn from high dimensional information to build base abstractions to build on. We would not suggest computer scientists start with these games which are complex in the base abstraction sense, but instead build a curriculum that naturally results in these abstractions being built. We need to understand that a child builds this translation over years of experience, and we can build systems that can create the same abstractions too.

Expand full comment

Dharma in the Roboticist

I agree with this and would add sociability and mutual attention as ongoing in-the-world challenges.

Expand full comment

Stefanie Tellex

Aug 14Edited

Thank you for commenting! Welcome! Totally agree with both, especially mutual attention. We are doing some work in my lab with human-dog interaction as a model for human-robot interaction in collaboration with Daphna Buchsbaum (https://sites.brown.edu/cocodevlab/). One thing we observed from watching people and their dogs is that it's *all* about establishing joint attention.

Expand full comment

Dharma in the Roboticist

Very cool! 🐶

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts