Interview

Pre-training isn't dead: GPT-4.5 followed scaling laws and RL is amplifying, not replacing, larger base models

Apr 22, 2025 with Jack Whitaker

Key Points

GPT-4.5 followed scaling law predictions; critics misread the math, confusing logarithmic improvement curves with model failure.
Reinforcement learning amplifies larger base models rather than replacing pre-training, with no visible diminishing returns across training axes.
Algorithmic efficiency and chip improvements will make significantly larger models feasible within years, enabling GPT-5 to combine RL techniques with substantially larger base models than current o-series.

Pre-training isn't dead: GPT-4.5 followed scaling laws and RL is amplifying, not replacing, larger base models

Summary

The central question Jack Clark and his collaborator Trevor set out to answer was simple: did GPT-4.5 actually underperform, or did critics misread how scaling laws work?

When they plotted GPT-4.5's benchmark results against a log-linear scaling curve, the model landed roughly where the math predicted — a little better in some areas, a little worse in others. The reaction calling it "the beginning of the end for scaling laws" missed a basic point: doubling compute doesn't double benchmark scores. That's not a failure of the scaling law; it's how the law works.

Pre-training isn't dead — it's just expensive

Clark's argument isn't that the pre-trained base model is the best tool available today. It isn't. OpenAI's o3, which Clark and Trevor estimate is smaller than GPT-4 but larger than GPT-4 Mini based on token output speed, currently outperforms everything — because it layers reinforcement learning across chain-of-thought, tool use, and every other axis on which the model operates. The more important claim is that none of those axes are showing clear diminishing returns, which means the ceiling isn't visible yet.

The economic constraint is real. Training a model ten times the size of GPT-4.5 today is prohibitively expensive and likely unservable. Clark's view is that algorithmic efficiency gains — the kind Dario Amodei documented around DeepSeek — and continued chip improvements will make that scale feasible within a few years, even without proportional compute growth. The functional equivalent of compute scales up as the engineering gets smarter.

GPT-5, in Clark's read, will combine those RL training techniques with a significantly larger base model than the current o-series. He sees o3's agentic properties — tool calling, web browsing — as a preview of a trajectory, not a finished product. The next step is models that can hold a whole project, not just a single task chain.

The AGI framing

Clark finds the "10-minute AGI" framing roughly right: models are now very general and very capable within bounded time horizons, but they don't yet sustain autonomous, long-horizon work. He expects that to improve as a natural consequence of context expansion and model scale, rather than requiring some discrete breakthrough. The harder question, which he says he genuinely can't answer, is why demonstrably capable AI hasn't moved macroeconomic indicators yet. His working hypothesis is that technology diffusion into services and institutions takes much longer than the software community intuits — Stripe only crossed 1% of global GDP in transactions last year, decades after online payments became technically trivial.

Stanford's startup culture problem

Clark's read on Stanford CS is that the campus is underestimating how fast the model landscape is moving — most students assume incremental improvement within the current paradigm, not the kind of step-change that o3 to GPT-5 might represent.

On the startup question specifically, Clark argues the prestige economics have gotten distorted. Founding a company became high-status faster than actually building things did, which produced a generation of side-project founders who aren't fully committed. His advice to any Stanford student asking what to do this summer: go work at Ramp or Cursor, understand how a high-functioning organization operates, then start a company. The wrapper-app era he describes as a cautionary tale — a lot of revenue built on customers who hadn't discovered ChatGPT yet, not on durable product advantage. Cursor is the counter-example he points to: a wrapper that scaled with the model rather than against it, by focusing on distribution and user experience instead of prompt engineering scaffolding.

You might also like...

Meta in talks to build $200B AI data center campus — 20x larger than its Louisiana project

Feb 26, 2025

GPT-4.5 reviewed: 10x more compute, but is the improvement worth it?

Feb 28, 2025

OpenAI launches GPT-4.5, its largest chat model yet, rolling out to Pro users

Feb 27, 2025