Why Nvidia bought Groq: SRAM architectures, inference disaggregation, and the race for ultra-low latency
Jan 5, 2026
Key Points
- Nvidia closed its $20 billion Groq acquisition in less than two weeks with no competing bidders, signaling conviction that Groq's SRAM architecture fills a gap in its inference chip portfolio.
- Decode latency has become the bottleneck in AI inference, and developers now pay 10x to 100x premiums for meaningfully faster token generation on tasks like code generation and research.
- Groq's fab partnership with Samsung or GlobalFoundries reduces Nvidia's foundry concentration risk while securing an SRAM-optimized design that complements its GPU-centric strategy across different workload requirements.
Summary
Jensen Huang closed the Groq acquisition in less than two weeks with no competing bidders, wiring $20 billion and pushing to complete the deal before year-end. The speed reflects conviction about what Groq's architecture adds to Nvidia's chip lineup.
Groq's founder Jonathan Habibi designed TPUs at Google, giving him deep expertise in task-specific chip design. That pedigree shaped the bull case early on, even when the startup's rollout seemed uncertain and expensive. The real strategic logic sits in inference disaggregation.
Decode bottleneck
Inference splits into two phases: prefill processes the full prompt, and decode streams tokens one at a time. Decode is latency-bound and bandwidth-bound, exactly where SRAM architectures excel. Groq's design trades memory capacity for extremely high memory bandwidth, making it ideal for ultra-low latency agentic reasoning workloads. Combined with Nvidia's existing Rubin lineup—CPX for large context windows, standard Rubin for training and batched inference—this creates a portfolio where Nvidia can mix and match chips to optimize the performance-versus-cost tradeoff for each workload.
SRAM-based chips have historically delivered higher token-per-second rates than GPUs, TPUs, or other ASICs, but at much higher cost per token due to smaller batch sizes. The question was whether users would actually pay the premium for speed. Cerberus and Groq's recent results show developers and enterprises are willing to pay significantly more, potentially 10x or 100x, for meaningfully faster inference, especially for economically valuable work like code generation and research. That willingness shifts the Pareto frontier, opening space for specialized ASICs optimized along different performance dimensions.
Habibi does not fab at TSMC; he uses either Samsung or GlobalFoundries. For Nvidia, that diversification reduces foundry concentration risk while securing access to an SRAM-optimized design that complements its existing GPU-centric strategy. The acquisition also returned roughly $4 billion to Social Capital and Chamath Palihapitiya, making it one of venture's largest exits.
Huang has the capital and is deploying it across labs and cloud infrastructure. Keeping Nvidia focused as a pure-play chip company rather than building competing cloud services or data center assets makes the Groq acquisition feel like a natural extension of existing strength. The bet is on fragmenting inference into workload-specific architectures, and Groq gives Nvidia the SRAM piece of that puzzle.