Interview

ARC Prize launches ARC-AGI v3: a game-based agentic benchmark where AI scores less than 1% vs. humans at 100%

Mar 25, 2026 with Mike Knoop

Key Points

  • ARC Prize launches ARC-AGI v3, an interactive game-based benchmark where frontier models from OpenAI, Google, Anthropic, and xAI all score below 1% while humans achieve 100%.
  • v3 removes hand-crafted scaffolding that enabled prior AI wins in Dota and Go, testing whether models can independently explore unseen domains and acquire strategy without human researcher guidance.
  • ARC Prize announces $2 million prize pool for 2026 and plans annual releases through v5, positioning the benchmark as a perpetual measure of the largest remaining gap between human and AI capability.
ARC Prize launches ARC-AGI v3: a game-based agentic benchmark where AI scores less than 1% vs. humans at 100%

Summary

The ARC Prize Foundation launched ARC-AGI v3, a benchmark fundamentally different from its predecessors. Where v1 and v2 presented static pattern-matching puzzles, v3 is an interactive game-based environment with over 100 games and nearly 1,000 levels designed to test agentic intelligence: exploration, goal discovery, strategy acquisition, and planning. Frontier models from all four major labs scored under 1% while humans scored 100%. OpenAI's o3 achieved 3.4%, Gemini 3.1 Pro scored 0.2–0.3%, Anthropic's latest landed in the same range, and Grok scored 0%.

When labs beat Dota, Go, and Mario in the 2016–2019 era, researchers were heavily hand-crafting the solutions by building custom search harnesses, feedback mechanisms, and domain-specific optimizations. ARC-AGI v3 removes that scaffolding. It tests whether AI can independently explore, discover rules, and adapt strategy without human researchers doing the hard cognitive work upfront.

Integrity measures

The public demonstration set contains only 25 games, explicitly not called a training set. The private verification set, the 100+ games AI labs are tested against, is deliberately different in mechanics, difficulty, and required intelligence capabilities. Knoop rejected a strategy where labs could screen-record public games, extract patterns into prompt context, and bootstrap performance on private tests.

A second integrity measure is a $10,000 cost cap per verification run. This prevents labs from drowning the benchmark in compute. OpenAI's December 2024 o3 preview cost roughly $2,000 per task and achieved higher scores, but the foundation set a practical ceiling. Achieving AGI while costing $50 million per prompt to replicate one hour of human labor isn't economically useful. The $10,000 limit is meant to detect real progress, not brute-force scaling.

ARC's predictive power

v1 sat unsolved for five years until AI reasoning emerged, a critical innovation comparable to the transformer breakthrough. A year later, v2 saturated when agentic coding capabilities appeared in November–December 2025 models. Knoop argues ARC holds predictive power: it identifies capability gaps before they become economically useful in production. v3 will likely signal when agents can operate effectively in open-ended, unseen domains, a prerequisite for moving beyond handcrafted domain applications like coding.

Two halves of the agent problem

Knoop splits agentic AI into perception-to-action (can the model perceive state, apply a known strategy, execute a plan?) and exploration-to-planning (can it build a world model, acquire goals, generate strategy on its own?). Recent models have made progress on the former. Exploration and strategy generation remain a major bottleneck, even in domains like Pokémon where vast training text exists. This is where v3 will be most useful for researchers.

What comes next

The ARC Prize is announcing a $2 million prize pool for 2026 covering both v2 and v3. The foundation has already started work on v4 and has plans for v5, with the intent to release annually over the next two years if frontier progress warrants it. The organization is not committed to a fixed timeline. The mission is to keep identifying large remaining gaps between human and AI capability. When ARC runs out of gaps to test, that is when the field can declare AGI. Until then, v3 is the current measure, and at less than 1% for frontier models, there is a long way to go.