Commentary

GPT-5 hands-on: solid upgrade but not the leap people expected

Aug 11, 2025

Key Points

  • OpenAI's GPT-5 is a meaningful but incremental upgrade, succeeding at spatial reasoning tasks like two-move Rubik's cube solutions that stumped GPT-4.1 and o3, but collapsing entirely at three-move problems.
  • GPT-5 fails at basic tool use despite web access, confidently hallucinating wrong solutions from online cube solvers, exposing a critical gap in agent-era AI training.
  • Anthropic's Claude Code generates responsive, interactive HTML UIs with embedded charts and formulas on demand, a genuine form-factor shift even as research quality remains weak.

Summary

GPT-5 is a solid incremental upgrade, not the leap many expected. Tyler tested the model extensively over the weekend and found it genuinely useful for coding, web search, and writing tasks, but not dramatically better than GPT-4.1 or o3. The model handles routine queries reliably. Agent-mode web searches work well enough—he found a Nintendo Switch 2 in stock. Writing quality surprised him on the upside.

GPT-5 shows real capability gains in spatial reasoning and multi-step logic. Tyler ran a Rubik's cube benchmark with just two moves remaining to solve. GPT-5 thinking succeeded while o3, GPT-4.1, and all other models failed. Even GPT-5 base required six minutes of thinking. That suggests genuine reasoning improvement, not just o3 repackaged.

The ceiling arrives immediately. Give GPT-5 three moves to go and it collapses entirely. Agent mode with tool use didn't help. The model found online cube solvers and hallucinated confident but wrong solutions, implying it hasn't been trained on those tools. A physics professor using GPT-5 Pro reportedly solved a graduate-level problem in 25 minutes that would historically take grad students months, which points to specialized domains where the model unlocks value.

The mismatch is instructive. GPT-5 can do spatial reasoning humans find hard but fails at tasks that should be trivial with web access—finding and correctly applying online solvers. This is the agent-era problem in miniature: LLMs now need training on how to use specific tools, not just raw reasoning.

Progress is real but granular. Agent capabilities lag reasoning capabilities. The exponential narrative that drove expectations doesn't match what's shipping.

Claude Code impressed with generative UI. When asked to create a research report on world models, it generated interactive HTML with embedded charts, formulas, and styled sections that were responsive on mobile. The research quality was weak and hallucinations abounded, but the form factor felt genuinely new. For the first time, an AI felt like it was generating bespoke UIs on demand rather than filling templates. That's the cleaner story for Anthropic right now than reasoning leaps.