Andrej Karpathy weighs in on whether LLMs are truly 'bitter lesson' compliant
Oct 2, 2025
Key Points
- Andrej Karpathy agrees with Richard Sutton's critique that large language models lack a clean, single algorithm compliant with the 'bitter lesson'—instead layering human bias through data curation, fine-tuning, and reward signal design.
- Karpathy rejects common defenses of pure reinforcement learning, noting that animal brains arrive with billions of evolution-trained parameters and human infants carry powerful inborn priors that cannot emerge from random interaction alone.
- The open question remains whether a pure RL system learning only from world interaction without imitation or supervised fine-tuning can actually be built, warranting serious exploration with substantial resources and time.
Summary
Andrej Karpathy pushes back on the claim that large language models comply with the bitter lesson, Richard Sutton's principle that simple algorithms paired with massive computation outperform systems built on encoded domain knowledge.
Sutton raised this critique during a recent podcast with Dylan Patel of SemiAnalysis. LLMs train on finite human-generated data, then undergo fine-tuning with human-curated examples and reinforcement learning signals designed by human engineers. The result is not a clean, computation-scaling algorithm. It is a complex artifact laden with human bias at every stage: pre-training data, fine-tuning, reward signal design. Sutton, who calls himself a classicist, envisions something closer to animal learning: direct interaction with the environment through reinforcement learning, intrinsic motivations like curiosity, and continuous learning at test time rather than train-once-and-deploy-once.
Karpathy agrees the criticism has merit. Frontier LLMs lack what he calls an actual single clean bitter lesson algorithm you could unleash upon the world. The field may be too far into exploit mode and not enough into explore mode.
Karpathy is skeptical of two common counterarguments: AlphaZero learning Go from scratch, and animals as proof that pure reinforcement learning works. Go is a closed, rule-bound game. Animal brains, he contends, are not blank slates. A baby zebra runs within minutes of birth, a task of stunning sensory-motor complexity that cannot emerge from random muscle twitching. Animal brains arrive with billions of parameters pre-trained by evolution. Human infants carry powerful inborn priors encoded in DNA.
The practical question: does a pure RL algorithm that learns only from world interaction without imitation or supervised fine-tuning actually exist or remain buildable? Karpathy does not claim to know. He frames it as an open question worth exploring with serious budget and time.
Today's models run roughly equal compute through pre-training and RL. Sutton's vision would collapse that to pure RL. OpenAI's Sora strategy hints at this direction: generating video assets, then immediately capturing user feedback as reward signal rather than relying on offline pre-training. That feedback loop tightens the RL cycle.
A secondary tension runs through the segment: how much learning is maturation versus experience? A baby in the womb moves and twitches, perhaps pre-birth reinforcement learning. A child isolated without imitation might still figure out walking, or might not. Animal learning is not a clean proof of concept but a complex mixture of inborn structure and adaptive refinement.