GPT-4.5 reviewed: 10x more compute, but is the improvement worth it?
Feb 28, 2025
Key Points
- OpenAI trained GPT-4.5 on 10x the compute of GPT-4 but delivered improvements so incremental that casual users prefer GPT-4 in public testing, raising hard questions about returns on frontier model spending.
- OpenAI is shifting narrative toward reasoning and reinforcement learning as the next frontier after pure pre-training scale hit diminishing returns, positioning o1 as the model that unlocks math and code.
- Value capture now depends on distribution and integration—Microsoft's Office, Google's Workspace, and Musk's xAI strategy embedded in Tesla and X—not marginal capability gains on benchmarks.
Summary
GPT-4.5: A 10x Compute Bet That Doesn't Move the Needle
OpenAI trained GPT-4.5 on roughly 10x the pre-training compute of GPT-4, yet the model delivers gains so diffuse and incremental that they barely register as improvements in real-world use. The release crystallizes a harder question for the AI industry: at what point does massive capital expenditure stop justifying itself?
The version numbering itself tells the story of diminishing returns. GPT-1 barely generated coherent text. GPT-2 was a "confused toy." GPT-3 crossed into genuine capability. GPT-3.5 sparked ChatGPT and sparked the AI moment—people actually felt they were talking to something human-like. GPT-4 felt subtly better across the board: "the water that rises all boats where everything gets slightly improved by 20%." Word choice was more creative. Nuance in prompts was understood better. Hallucinations declined. But concrete examples of GPT-4 outperforming 3.5 were hard to find.
GPT-4.5 lands in exactly the same place. The model improves on creative, knowledge-heavy, and analogy-making tasks—the EQ jobs rather than IQ jobs. Code and math show no meaningful lift because 4.5 relies purely on pre-training, supervised fine-tuning, and RLHF; OpenAI has not yet layered reasoning on top, the way it did with o1. That means when reasoning matters—calculus problems, complex coding tasks—o1 still owns the space.
Testing this in public revealed a crack in the narrative. A community poll on X asked users to compare GPT-4 and GPT-4.5 responses to five humorous prompts. GPT-4.5 won on only one. Sixty percent of users preferred GPT-4 on four out of five. The margins were significant: on one question, two people preferred GPT-4 for every one who picked 4.5. The poster, who tested it himself, found 4.5 better across all cases, and attributed the gap to "high taste"—he could perceive nuances (a punchier roast, more sophisticated rhyme schemes) that casual users missed. Others in the replies agreed. One comment: "The voters have poor taste."
This framing matters because it exposes the real problem. GPT-4.5 isn't broken; it's just not obviously better in ways that move behavior. For $10 billion in training spend, OpenAI gets a model that some experts prefer and most casual users do not. That's a poor return on capital in a market already saturated with capable models.
The reasoning pivot and the scaling debate
OpenAI's response, delivered through Mark Chen, is to shift the narrative away from pure pre-training scale. Chen posted that "we found a new paradigm through reasoning, which we're also scaling." The implication is clear: pure scale has hit diminishing returns; reasoning and reinforcement learning are the next frontier. Critics like Gary Marcus are taking victory laps, arguing that symbol manipulation and structured reasoning—not just bigger models—are necessary. Marcus has spent years arguing that deep learning would plateau exactly here.
But the criticism itself reveals Marcus's weakness. He has been saying this for years without building an alternative that works better. DeepSeek's team, by contrast, has demonstrated genuine algorithmic efficiency—they approached frontier capability with lower training costs. That's a different, more credible critique than saying "pre-training doesn't matter" while offering nothing in return.
OpenAI's move toward reasoning likely sidesteps the problem. If o1 shows that reasoning unlocks math and code where pure scale does not, then the company has found a real lever. The question is whether that lever will push hard enough to justify the next generation of capital spending.
The distribution and capital story
What matters more than the raw capability lift is what OpenAI does with it. A $100 million GPT-4 training run generated billions in annual revenue. If a $10 billion spend on frontier models—chips, energy, time—gets baked into silicon, integrated into Tesla vehicles, embedded into Grok, or licensed into enterprise workflows, the company's job is not to show 10x improvement per dollar spent. It's to show that the total investment generates enough margin and reach to pay for itself. Amazon spent billions on data centers without making the Amazon.com experience 10x better; it just printed money by owning the infrastructure.
Elon Musk's xAI strategy makes this bet explicit. Grok integrates into X, feeds into Tesla's voice assistant and autonomous systems, and potentially hooks into Neuralink. The value capture happens through distribution, not through marginal capability gains. If a user can ask Grok questions while driving, managing tasks, and accessing financial services, the utility compounds even if the underlying model is merely good enough, not revolutionary.
This also explains why incumbents with distribution advantages—Microsoft with Office, Google with Workspace, Ramp with payment rails and banking licenses—are better positioned than pure-play AI startups to harvest value from frontier models. A bank won't license Cursor to build a new banking product; it will license Claude or GPT-4.5 and wrap it in compliance, regulated payment rails, and existing customer relationships. Ramp and similar fintech platforms can swap in new frontier models in days and pass the improvement to customers as better receipt processing and accounting accuracy. DocuSign could rebuild itself with Cursor but has no way to distribute it; Google can distribute document signing to everyone already paying for Workspace with one click.
The moat question
Mike Mignano, a Lightspeed partner, posted a sharp critique: "Technology is not a moat. Never has been." The concern is that Cursor, Lovable, Bolt, and similar AI-native tools can be cloned. Windsurf AI reached a $40 million run rate in months. Lovable grew from zero to $17 million in three months. The same tooling that lets these companies move fast lets competitors move fast too. If you can use Cursor to build Cursor, or use Devon to build a variant of Devon, the revenue edge erodes quickly.
Scott Wu, CEO of Cognition (which built Devon), reframed this by arguing that Devon is not a developer tool but a team member. Devon shows "massive improvements over o1 and 4o" on Cognition's agentic coding benchmarks and spikes on architecture and cross-system interactions where GPT-4.5 excels. The implication is that Devon's defensibility comes not from proprietary tooling but from the complexity of orchestrating multiple models, managing agent state, and handling real-world software delivery. That's harder to clone than a fork and UI layer. But it's also speculative—the field is too young to know whether that sticks.
The capital and narrative dance
The deeper tension is financial. OpenAI needs to justify a $500 billion valuation. Sam Altman needs to justify raising $150 billion in capital. He can't say, "Our model is slightly better at creative writing than last year's version." He has to sell superintelligence on the horizon. But if he oversells—if he claims AGI is imminent—regulators react, doomers mobilize, and protests follow. If he undersells—if GPT-4.5 lands as merely incremental—capital dries up. Investors don't write $20 billion checks for 20% improvements.
Satya Nadella (Microsoft) has found a middle ground: position AI as a co-pilot, a thought partner, not a replacement agent. That still justifies massive CapEx but does not require claiming superintelligence is arriving next quarter. OpenAI's shift toward reasoning may be an attempt at the same balance—proof that new paradigms still unlock value, without the "last chance for GPT-5" pressure that critics like Chubby are applying.
The irony is that none of this may matter. There is so much value still locked in the current technology stack that model development could pause for a decade and implementers could spend the entire time harvesting it. The .com boom took twenty years to cash out even though the internet's core capabilities stabilized early. Frontier models at the current level of capability are good enough to reshape software development, finance, creative work, and customer service. The question of whether they plateau or break through to reasoning is secondary to the question of whether companies will actually build around them.
GPT-4.5 is, in that sense, not a failure—just a reminder that the story has moved on from model leaderboards to implementation velocity, distribution reach, and capital efficiency. OpenAI can afford to miss the next benchmark by 2 points if its customers keep paying and its chips keep running.