News

Grok 4 launches at the AI frontier, posting record benchmark scores with equal pre- and post-training compute spend

Jul 10, 2025

Key Points

xAI's Grok 4 spends equal compute on reinforcement learning and pre-training for the first time, scoring 44.4% on Humanity's Last Exam versus 26.9% for second place.
Perfect 100% score on AIME 2025 signals data contamination rather than reasoning ability, undermining benchmark credibility as a measure of general intelligence.
xAI shipped Grok 4 in under two years from founding, but frontier models without unique leverage like consumer adoption or vertical strength face pricing pressure toward cloud commodity rates.

Alex Wang Mike Knoop Elon Musk Logan Kilpatrick xAI

Summary

xAI launched Grok 4 with equal spending on pre-training and reinforcement learning, a first in the industry. The model scores 44.4% on Humanity's Last Exam, 17 percentage points ahead of second place at 26.9%, and reaches 88% on GPQA hard graduate-level math. Pricing is $3 per million input tokens and $15 per million output tokens, with a 256k context window.

The technical shift centers on reinforcement learning maturity. OpenAI's O-series proved test-time scaling works by spending thousands of inference tokens to solve hard problems. Grok 4 executes that playbook at scale using mixture of experts internally and running multiple parallel instances to compare outputs before returning the best answer. This stacks intelligence layers through specialized sub-models. Eventually this becomes distributed querying: users check Grok, Claude, GPT, and Gemini in parallel and pick the best answer.

Benchmarks are losing signal. Data contamination threatens math results. Identical AIME 2025 questions appear on Cora, similar ones on Math Stack Exchange. When benchmarks circulate widely and enter training data, models memorize rather than reason. Grok 4's perfect 100% on AIME reads as a red flag, not proof of general intelligence. Mike Knoop at ARC notes that OpenAI's O-series progression on ARC matters more than Grok's win here. Test-time adaptation is the frontier, not executing existing playbooks.

Adoption is the real question. Polymarket traders shifted 48% odds to xAI and 45% to Google in bets on the best AI model by end of July. Benchmarks don't drive revenue. Anthropic stays off the leaderboards entirely, focused on codegen and business rather than benchmark games. Azure's ability to vend multiple frontier models commoditizes the foundation layer. Without unique leverage—a consumer base like ChatGPT, vertical strength like Anthropic's coding, or enterprise lock-in—frontier models become cloud infrastructure pricing toward zero.

Elon's presentation was deliberately unpolished. Engineers presented slides with light-mode screenshots on dark backgrounds and unstyled demos. The culture signal mattered more than production values. One engineer's "It's a good model, sir" became the meme. Elon predicted Grok will discover new physics within two years, a claim that assumes a single superintelligence with PhD-level expertise across every domain, able to synthesize breakthroughs. He did not detail the mechanism.

Speed is the real story. xAI moved from founding to frontier in under two years, shipping Grok 3 and now Grok 4. Version numbering itself digs at OpenAI's stalled GPT5 cycle. Grok ships a new major version every three months. The organization runs lean with freewheeling technical innovation and no politics or constraints, retaining talent that Meta pays $200 million annually to land. Whether benchmarks matter or adoption drives value, xAI has proven it can move fast.

You might also like...

xAI sues former employee for stealing Grok trade secrets before joining OpenAI

Aug 28, 2025

Court filing reveals Elon Musk asked Zuckerberg to join xAI's bid to acquire OpenAI at 80% discount

Aug 22, 2025

xAI sues Apple and OpenAI, alleging monopolistic partnership blocks AI competition

Aug 25, 2025