Commentary

GPT-5.4 lands with strong early reviews — but comedy bench still unsolved

Mar 6, 2026

Key Points

  • Merkor's Apex Agents benchmark shows GPT-5.4 is the first model to exceed a 50% mean score on enterprise agent tasks, up 15.7 points in under three months.
  • A mathematician named Bartos says GPT-5.4 solved a research problem he had worked on for twenty years, calling it his personal 'Move 37' moment.
  • The manager gap remains: AI tools transform developers and domain experts but won't reach broad business adoption until users can describe an outcome and receive deployed software.

Summary

GPT-5.4 has landed to some of the strongest early reviews of any recent model release, but the reception surfaces a recurring gap. The models are getting dramatically better at the tasks developers throw at them, while most business managers have yet to find their moment of genuine breakthrough.

The developer community is enthusiastic. Merkor's internal benchmark, Apex Agents, shows GPT-5.4 as the first model to pass a 50% mean score on enterprise agent tasks, up 15.7 points in under three months. A year ago, frontier models scored below 5% and couldn't edit an Excel sheet. Developer and prominent livestreamer Theo says 5.4 is "absurdly good" and he doesn't want to use anything else. Ben Hylak calls it "the first model in a long time worth your time to try." In a direct speed test, GPT-5.4 High Fast completed all eight phases of a Mac OS coding project in roughly an hour while Claude was still on phase two. Justin, who spent a week testing, describes it as a strong mix of Opus and Codex, fast and conversational, though he notes it lacks some of Opus's eagerness and Codex's precision.

The most striking reaction comes from a mathematician identified as Bartos, who says GPT-5.4 solved a research problem he had been working on for twenty years. He frames it as his personal "Move 37" moment, a reference to the DeepMind AlphaGo move against Lee Sedol that appeared to be a blunder before proving decisive.

The manager gap

That kind of breakthrough is happening for individual contributors, including developers, researchers, and domain experts, but not yet for managers. AI tooling has evolved from Stack Overflow replacement to autocomplete to agentic coding, but none of those interfaces map naturally to how a restaurant owner or a mid-level executive describes what they actually need. A manager who wants a better email system or wants to audit a $3,000 payment processing fee isn't going to find the current tools transformative. Broad diffusion will come when someone can describe a business outcome and receive finished, deployed software, not a pile of code that still requires a developer to ship.

Comedy bench

GPT-5.4 reliably produces the same hunter-and-gunshot joke when asked for the funniest joke in the world, and generates structurally identical "you're telling me..." format jokes when prompted for variations on the shrimp-fried-rice format. The consistency looks less like wit and more like retrieval from a high-vote Reddit thread. RL fine-tuning is clearly optimized for economically valuable tasks, and comedy isn't one of them. Grok gets a mention as the class-clown alternative, though the consensus is that it resembles a class clown who keeps getting expelled.

Gaming and modding

Early signs point to an AI-powered modding boom. One developer one-shotted a functional Minecraft clone in 24 minutes. Another experiment pointed GPT-5.4 at the Pokemon Red ROM to autonomously edit and replace Pokemon with AIs. The more durable opportunity may be mashups built on existing game engines, reskinning Age of Empires rather than building an RTS from scratch, or crossing Call of Duty aesthetics with The Sims. The same logic applies to AI music, where a fifties barbershop quartet cover of a 2000s rap song lands because both reference points are familiar, just never combined.

DeepMind financials

Google has never broken out DeepMind's financials. Unlike AWS, which the SEC eventually forced into a discrete reporting line, DeepMind captures value across YouTube, Search, and other surfaces in ways that are hard to attribute to a single P&L. OpenAI and Anthropic are both scaling revenue nearly vertically, with Anthropic roughly a month behind, but what Gemini is actually doing commercially remains opaque. A regulatory push similar to the one that forced AWS disclosure could change that picture significantly.