Pre-training isn't dead: GPT-4.5 followed scaling laws and RL is amplifying, not replacing, larger base models
Apr 22, 2025 with Jack Whitaker
Key Points
- GPT-4.5 followed scaling law predictions; critics misread the math, confusing logarithmic improvement curves with model failure.
- Reinforcement learning amplifies larger base models rather than replacing pre-training, with no visible diminishing returns across training axes.
- Algorithmic efficiency and chip improvements will make significantly larger models feasible within years, enabling GPT-5 to combine RL techniques with substantially larger base models than current o-series.
Summary
The central question Jack Clark and his collaborator Trevor set out to answer was simple: did GPT-4.5 actually underperform, or did critics misread how scaling laws work?
When they plotted GPT-4.5's benchmark results against a log-linear scaling curve, the model landed roughly where the math predicted — a little better in some areas, a little worse in others. The reaction calling it "the beginning of the end for scaling laws" missed a basic point: doubling compute doesn't double benchmark scores. That's not a failure of the scaling law; it's how the law works.
Pre-training isn't dead — it's just expensive
Clark's argument isn't that the pre-trained base model is the best tool available today. It isn't. OpenAI's o3, which Clark and Trevor estimate is smaller than GPT-4 but larger than GPT-4 Mini based on token output speed, currently outperforms everything — because it layers reinforcement learning across chain-of-thought, tool use, and every other axis on which the model operates. The more important claim is that none of those axes are showing clear diminishing returns, which means the ceiling isn't visible yet.
The economic constraint is real. Training a model ten times the size of GPT-4.5 today is prohibitively expensive and likely unservable. Clark's view is that algorithmic efficiency gains — the kind Dario Amodei documented around DeepSeek — and continued chip improvements will make that scale feasible within a few years, even without proportional compute growth. The functional equivalent of compute scales up as the engineering gets smarter.
GPT-5, in Clark's read, will combine those RL training techniques with a significantly larger base model than the current o-series. He sees o3's agentic properties — tool calling, web browsing — as a preview of a trajectory, not a finished product. The next step is models that can hold a whole project, not just a single task chain.
The AGI framing
Clark finds the "10-minute AGI" framing roughly right: models are now very general and very capable within bounded time horizons, but they don't yet sustain autonomous, long-horizon work. He expects that to improve as a natural consequence of context expansion and model scale, rather than requiring some discrete breakthrough. The harder question, which he says he genuinely can't answer, is why demonstrably capable AI hasn't moved macroeconomic indicators yet. His working hypothesis is that technology diffusion into services and institutions takes much longer than the software community intuits — Stripe only crossed 1% of global GDP in transactions last year, decades after online payments became technically trivial.
Stanford's startup culture problem
Clark's read on Stanford CS is that the campus is underestimating how fast the model landscape is moving — most students assume incremental improvement within the current paradigm, not the kind of step-change that o3 to GPT-5 might represent.
On the startup question specifically, Clark argues the prestige economics have gotten distorted. Founding a company became high-status faster than actually building things did, which produced a generation of side-project founders who aren't fully committed. His advice to any Stanford student asking what to do this summer: go work at Ramp or Cursor, understand how a high-functioning organization operates, then start a company. The wrapper-app era he describes as a cautionary tale — a lot of revenue built on customers who hadn't discovered ChatGPT yet, not on durable product advantage. Cursor is the counter-example he points to: a wrapper that scaled with the model rather than against it, by focusing on distribution and user experience instead of prompt engineering scaffolding.