Mike Knoop: no single AI model dominates today — we need new ideas to reach AGI
Jun 19, 2025 with Mike Knoop
Key Points
- No reasoning model dominates: o3 leads on raw performance while Grok and Gemini 2.5 Thinking offer better cost efficiency, forcing labs to optimize along two dimensions instead of chasing single benchmarks.
- Startups including Mechanized Work, Morph, and Habitat are building RL environments to generate synthetic training data, shifting the labeling supply chain away from human annotation and toward synthetic data pipelines.
- AGI remains idea-constrained, not compute-constrained—ARC-AGI v2 is unsaturated and progress has slowed, signaling that current LLMs have exhausted existing knowledge and foundational research breakthroughs are the missing ingredient.
Summary
Mike Knoop, co-creator of the ARC benchmark, delivers a clear-eyed read on where the AI frontier actually stands: no single model dominates, AGI remains idea-constrained, and the data labeling industry is undergoing a structural shift that will disadvantage legacy players.
No Clear Winner at the Frontier
Across every major lab that has shipped a reasoning model — OpenAI, Anthropic, Google, and others, with Meta notably absent — benchmarks now require two-dimensional evaluation. Cost and efficiency must be weighed alongside raw accuracy. o3 high leads on pure performance, but Grok and Gemini 2.5 Thinking offer better cost-to-performance ratios for production deployments. A single benchmark number, Knoop argues, is marketing, not measurement.
Reasoning models also face a product-market fit problem. Despite the enthusiasm around DeepSeek's public launch, Knoop suspects AR reasoning systems have weaker consumer traction than standard language models. Zapier's AI usage, which is growing on an exponential curve, is still running predominantly on GPT-4-class or cheaper models, suggesting the current agent automation wave is driven by use-case diffusion, not model capability breakthroughs.
Domain Specialization Is Coming
Knoop expects domain-specific benchmark differentiation to emerge over the next 12 to 24 months as labs lean into reinforcement learning environments to generate synthetic chain-of-thought training data. The original o3 results already hint at this: gains in math and coding were far larger than gains in legal reasoning, countering the intuition that symbolic, self-consistent domains would transfer cleanly. Labs are now running RL environments across a wide range of domains to close those gaps.
Data Labeling Shifts to RL Environments
The macro move away from pre-training on labeled text toward RL-generated synthetic data is reshaping the training data supply chain. A cluster of startups founded in the last few months — including Mechanized Work, Morph, and Habitat — are building RL environments to sell synthetic data directly to frontier labs. Knoop expects this segment to grow significantly while traditional human-labeling volume shrinks. He views Scale AI's recent exit timing as well-judged, analogous to earlier inflection points when autonomous vehicle labeling demand peaked and then plateaued.
AGI Is Idea-Constrained, Not Compute-Constrained
The sharpest point in the conversation concerns ARC benchmark data. ARC-AGI v2 is completely unsaturated — no current system can solve it at meaningful rates, and progress on the ARC 2025 Kaggle contest has been slower this year than in 2024. Knoop's conclusion is direct: the bottleneck to AGI is not compute or data scale, it is the absence of new foundational ideas. Current LLMs, trained on effectively all digitized human knowledge, have produced marginal novel discoveries. AlphaEvolve is the strongest counterexample he cites, but its outputs remain at the margins of already-known problem spaces such as matrix multiplication optimization.
Knoop funded the ARC Prize in part to correct a damaging narrative that had taken hold among young researchers in early 2024 — that AGI was essentially solved and the rational career move was to build applications rather than pursue foundational research. He frames a diverse, open research environment modeled on the 2010–2020 AI era as the necessary condition for eventual AGI, pointing to the transformer's emergence from that period of open collaboration.
LLM Subscriptions and the XAI Revenue Question
On consumer monetization, Knoop predicts inference costs will fall far enough that standalone LLM subscriptions become embedded subsidies within broader product experiences rather than standalone line items. Power users will pay; the base rate will not. On xAI, he sees the path to meaningful revenue running through Elon Musk's existing ecosystem — Tesla, SpaceX, robotics — where higher-latency, higher-intelligence inference is economically justified, rather than through direct consumer competition with OpenAI.
ChatGPT daily usage is reportedly approaching 30 minutes per day, closing the gap with Instagram, which Knoop attributes primarily to use-case diffusion and the app's role as a digital companion rather than to step-change model improvements.