Grok 4 launches at the AI frontier, posting record benchmark scores with equal pre- and post-training compute spend
Jul 10, 2025
Key Points
- xAI's Grok 4 spends equal compute on reinforcement learning and pre-training for the first time, scoring 44.4% on Humanity's Last Exam versus 26.9% for second place.
- Perfect 100% score on AIME 2025 signals data contamination rather than reasoning ability, undermining benchmark credibility as a measure of general intelligence.
- xAI shipped Grok 4 in under two years from founding, but frontier models without unique leverage like consumer adoption or vertical strength face pricing pressure toward cloud commodity rates.
Summary
xAI launched Grok 4 with equal spending on pre-training and reinforcement learning, a first in the industry. The model scores 44.4% on Humanity's Last Exam, 17 percentage points ahead of second place at 26.9%, and reaches 88% on GPQA hard graduate-level math. Pricing is $3 per million input tokens and $15 per million output tokens, with a 256k context window.
The technical shift centers on reinforcement learning maturity. OpenAI's O-series proved test-time scaling works by spending thousands of inference tokens to solve hard problems. Grok 4 executes that playbook at scale using mixture of experts internally and running multiple parallel instances to compare outputs before returning the best answer. This stacks intelligence layers through specialized sub-models. Eventually this becomes distributed querying: users check Grok, Claude, GPT, and Gemini in parallel and pick the best answer.
Benchmarks are losing signal. Data contamination threatens math results. Identical AIME 2025 questions appear on Cora, similar ones on Math Stack Exchange. When benchmarks circulate widely and enter training data, models memorize rather than reason. Grok 4's perfect 100% on AIME reads as a red flag, not proof of general intelligence. Mike Knoop at ARC notes that OpenAI's O-series progression on ARC matters more than Grok's win here. Test-time adaptation is the frontier, not executing existing playbooks.
Adoption is the real question. Polymarket traders shifted 48% odds to xAI and 45% to Google in bets on the best AI model by end of July. Benchmarks don't drive revenue. Anthropic stays off the leaderboards entirely, focused on codegen and business rather than benchmark games. Azure's ability to vend multiple frontier models commoditizes the foundation layer. Without unique leverage—a consumer base like ChatGPT, vertical strength like Anthropic's coding, or enterprise lock-in—frontier models become cloud infrastructure pricing toward zero.
Elon's presentation was deliberately unpolished. Engineers presented slides with light-mode screenshots on dark backgrounds and unstyled demos. The culture signal mattered more than production values. One engineer's "It's a good model, sir" became the meme. Elon predicted Grok will discover new physics within two years, a claim that assumes a single superintelligence with PhD-level expertise across every domain, able to synthesize breakthroughs. He did not detail the mechanism.
Speed is the real story. xAI moved from founding to frontier in under two years, shipping Grok 3 and now Grok 4. Version numbering itself digs at OpenAI's stalled GPT5 cycle. Grok ships a new major version every three months. The organization runs lean with freewheeling technical innovation and no politics or constraints, retaining talent that Meta pays $200 million annually to land. Whether benchmarks matter or adoption drives value, xAI has proven it can move fast.