What Babies Know That Robots Don’t

On tokenization, biological Fourier transforms, and becoming data engineers

Apr 15, 2026

I watched Episode 4 of Netflix’s Babies documentary, expecting cute footage, and ended up thinking about representation learning for three days.

The episode follows researchers studying how infants crack language. Babies sit in labs, headphones on, listening to streams of made-up syllables. “Pabiku, golatu, tibudo, daropi, pabiku…” No pauses. No visual cues. Just sound. And after two minutes, these eight-month-olds can tell which three-syllable chunks belong together.

They’re tokenizing raw audio before they can speak words.

The Saffran Experiment

In 1996, Jenny Saffran and colleagues at the University of Rochester (where Stefie grew up!) ran a now-famous experiment. They created four nonsense “words.” Pabiku, tibudo, golatu, and daropi. They concatenated them into a continuous stream. Within each word, the transitional probability between syllables was 1.0: pa always leads to bi, bi always leads to ku. Across word boundaries, the probability dropped to 0.33: ku could be followed by ti, go, or da.

After just two minutes of exposure (about 45 repetitions of each word), infants could distinguish the “words” from “part-words” like tudaro (spanning the boundary between golatu and daropi). The only information available was the statistical structure of syllable co-occurrence.

The researchers describe this as babies finding correlations in sounds. I immediately thought of tokenization and attention. It is the same statistical structure that transformers exploit, discovered twenty years earlier in infant cognition.

What’s Actually Happening

By the time a baby sits in Saffran’s lab, that child has already spent months building powerful abstractions about sound. The cochlea has been decomposing pressure waves into frequency bands. The auditory cortex has been learning what similar sounds mean. The two minutes of nonsense syllables aren’t learning from scratch, but instead applying the already-developed representations to a new domain.

Patricia Kuhl, a researcher at the University of Washington, showed that babies are taking statistics on the sounds around them from the moment they can hear. This includes time spent in the womb. Newborns show a preference for their native language within days of birth (Moon et al. 1993), and by six months, infants in Seattle and Stockholm already perceive vowels differently, tuned to the distributions in their respective languages. The infrastructure for statistical word learning is built before word learning happens.

This is closer to test-time fine-tuning than to training from scratch, or maybe to in-context learning, if you prefer the LLM framing. The baby arrives at the experiment with a foundation model of auditory processing, and the two minutes of pabiku golatu are just a prompt.

The Inner Ear is a Fourier Transform

One detail from the documentary stuck with me: the researchers mention that babies are hearing the ongoing melodies of speech as a flow of the environment. Before understanding any words, they’re sensitive to prosody, the pitch contours and rhythms that segment speech into phrases.

This detail points to something important. The cochlea, that snail-shaped organ in the inner ear, is a biological Fourier transform. Different positions along its length resonate to different frequencies. When sound enters, it’s physically decomposed into frequency bands by the structure of the organ itself. Babies don’t hear raw air pressure fluctuations, but instead something akin to how an engineer uses a spectrogram to analyze the fluctuations in sound over time.

[FIGURE: Diagram of cochlear tonotopy—the base responds to high frequencies, the apex to low frequencies] Figure from Smimite, A. (2014). Immersive 3D sound optimization, transport, and quality assessment (Doctoral thesis). Université Sorbonne Paris Nord, France.

The cochlea’s frequency decomposition is a bias built into our collective wetware, a structure adapted from evolutionary processes rather than something trained.

An Experiment

I wanted to see if this matters, so I ran a simple experiment replicating the Saffran structure. I generated synthetic syllables as combinations of two frequencies, concatenated them into “words” and “part-words” following the same transitional probabilities, and trained simple classifiers to distinguish them. The main factor I wanted to test was how the effectiveness of the learning changed when I transformed the audio signal differently.

Here are the five different, non-exhaustive ways we could process audio information:

Raw Waveform: The audio signal as-is—amplitude over time. A 450ms word at 16kHz is 7,200 numbers representing air pressure fluctuations. The problem: phase shifts destroy structure. The same syllable starting at a different point in time looks completely different, even though it sounds identical.
Spectrogram (STFT): Short-Time Fourier Transform. We slide a window across the signal and compute the frequency content at each position. This gives us a 2D image: time on one axis, frequency on the other, intensity as brightness. Now, the phase doesn’t matter as we see what frequencies are present at each moment.

Mel Spectrogram: Same as spectrogram, but frequencies are warped to match human perception. We hear the difference between 100Hz and 200Hz more easily than between 8000Hz and 8100Hz. The mel scale compresses high frequencies, mimicking the cochlea’s logarithmic frequency response.

MFCC (Mel-Frequency Cepstral Coefficients): Take the mel spectrogram, apply a log transform, then take the DCT of each frame. This captures the “shape” of the spectrum, roughly corresponding to vocal tract configuration, while discarding fine spectral detail. This has been standard in speech recognition for decades.

DCT of Waveform: Apply the Discrete Cosine Transform directly to the raw waveform. This decomposes the signal into frequency components, but globally across the entire duration. Unlike the spectrogram, there’s no time localization. A syllable at the start vs. the end of the word produces very different coefficients. It’s the wrong tool for a sequential structure. We show this because it is important to see that hiding these components doesn’t provide enough information to the neural network to allow it to learn.

[FIGURE: Side-by-side visualization of “pabiku” in all five representations]

Results: Synthetic Tones

Here are the results of our synthetic tones

Using the raw waveform doesn’t even beat chance. The classifier is trying to find statistical structure in a representation that doesn’t make that structure visible. The mel spectrogram with frequencies weighted the way the cochlea weighs them trivializes the task.

The DCT destroys the exact information we need. The Saffran task is about the sequential structure of which syllable follows which. But the DCT treats the entire word as a single unit and asks, “What frequencies are present overall?” without preserving temporal order. It’s the wrong decomposition for a task that depends on sequence. At least the raw waveform preserves time, even if it encodes it poorly.

Verification with Text-to-Speech

To confirm these results weren’t an artifact of my synthetic tone generation, I ran the same experiment using Google’s text-to-speech engine to produce actual spoken syllables.

Example word "pabiku" - synthetic tones