ARC Prize launches V3 interactive benchmark and $10K agent contest to challenge frontier AI systems
Jul 18, 2025 with Mike Knoop
Key Points
- ARC Prize launches V3 as an interactive benchmark with game-based environments instead of static puzzles, shifting focus to test whether AI agents can learn rules from scratch and adapt efficiently.
- The $10,000 contest awards based on three private games rather than public ones, preventing agents from overfitting and forcing genuine generalization across novel environments.
- ARC's design philosophy treats AI progress as idea-constrained rather than compute-constrained, arguing that frontier models failing at human-solvable tasks signals missing structural insights, not insufficient scale.
Summary
ARC Prize is launching ARC-AGI V3 as an interactive benchmark, replacing the static grid puzzles of V1 and V2 with dynamic games. A public preview includes three games and a $10,000 agent contest running for one month, with the full dataset targeting early 2026.
The move to interactive games
V1, introduced in 2019, was designed to challenge deep learning as a paradigm. V2 escalated difficulty by requiring longer reasoning chains. xAI scored 16% on V2 last week, which Mike describes as evidence of how quickly frontier labs can move once a direction is established, though still early.
V3 abandons static puzzles entirely. The three public games launching today are intentionally dissimilar to each other, with only one resembling a top-down 2D agent game. The eventual full dataset will include hundreds of games designed to be entirely novel and diverse. The gap between what humans find easy and what AI finds hard is explicitly wider in V3 than in earlier versions.
Why this structure matters
The shift responds to the emergence of AI agent systems operating in open-ended environments. Static reasoning benchmarks remain useful, but V3 is built to stress-test whether agents can learn rules from scratch, adapt as new mechanics are introduced, and do so efficiently.
Efficiency is not secondary. V3 games impose hard limits on actions, preventing agents from brute-forcing solutions by replaying thousands of times, something humans do not need to do. The live demo of Locksmith illustrated this: a human player discovered the rules, adapted to new mechanics across levels, and completed several levels in under five minutes with a limited action budget. Current frontier agents cannot replicate that performance.
Contest design
The $10,000 contest awards the top score on three private games, not the three public ones. Agents optimized for the public games will not transfer, so overfitting to the visible set is a dead end. The API to build and run agents against the public games launched alongside the preview.
The million-dollar prize for V1 remains unclaimed. Kaggle contest rules require hitting a high accuracy score within a fixed compute budget and open-sourcing the solution. On a human-efficiency-weighted basis, the best V1 scores are around 60% against a human benchmark of 85%.
The idea-constraint argument
Mike argues that ARC benchmarks expose an idea-constrained regime, not a compute-constrained one. Tasks solvable by average humans remain hard for frontier models, while those same models ace PhD-level benchmarks. This suggests the field is missing something structural, not just scale. That conclusion carries a practical implication: individual researchers and small teams with modest budgets can still move the frontier, which is the underlying rationale for the prize structure and its open-source requirement.
A majority, or close to it, of past ARC contest leaderboard entries came from outside the United States. Mike cites this as evidence that relevant talent is globally distributed and not solely concentrated in well-funded labs.
On alignment concerns
Regarding AI psychosis and alignment concerns circulating that week, Mike advocates an empirical rather than precautionary approach. Fast-moving, visible harms trigger societal responses more effectively than slow-creeping ones. He opposed California's SB 1047 on similar grounds, arguing it was built on predictions about harms that ARC's own results suggested were premature, specifically the assumption that AGI would emerge from scaling pre-training alone.