News

Cerebras runs OpenAI's GPT-O1 at 3,000 tokens/second — fastest ever achieved in production

Aug 6, 2025

Key Points

  • Cerebras runs OpenAI's GPT-O1 at 3,000 tokens per second, the fastest production speed any OpenAI model has achieved, validating the wafer-scale chip maker's decade-long bet.
  • Reasoning models that take minutes on NVIDIA GPUs execute in seconds on Cerebras systems, making high-quality inference fast enough for tasks that currently settle for lower-quality instant results.
  • Cerebras positions itself as a specialized tier in multi-model inference stacks, enabling hybrid workflows where systems return quick preliminary answers while spawning deeper reasoning in the background.

Summary

Cerebras, the wafer-scale chip manufacturer, just achieved a production milestone that validates its core bet: running OpenAI's GPT-O1 reasoning model at 3,000 tokens per second, the fastest speed any OpenAI model has reached in production.

The company's partnership traces back to 2016, when Sam Altman and OpenAI's founders became early investors in what was then a PowerPoint-stage startup. Large language models hadn't been invented yet. Now that reasoning models exist, the partnership has come full circle.

Cerebras manufactures chips at wafer scale, meaning an entire silicon wafer functions as a single processor rather than being sliced into individual chips. The cost of failure is extreme. A single defect anywhere on the wafer means scrapping the entire $10+ million piece of silicon. For years, skeptics argued yields would never reach viability. CEO Andrew Feldman now argues those early bets are paying off.

The performance gap is stark. On Cerebras's third-generation wafer-scale engine, GPT-O1 120B runs at 3,000 tokens per second. On NVIDIA GPUs, the same reasoning takes minutes instead of a single second.

Cerebras is positioning itself as a specialized tier in a multi-model inference stack, not a replacement for standard inference clusters. Current reasoning models like GPT-O1 Pro are so slow that users typically avoid them for simple knowledge retrieval, opting instead for faster models like GPT-4o. If reasoning-quality answers could return in seconds rather than minutes, the calculation changes. Reasoning becomes usable for tasks that currently settle for lower-quality instant results. The efficiency gain also enables hybrid workflows where a system returns a fast preliminary answer while spawning a deeper reasoning process in the background, matching how users naturally interact with human experts.