LiveKit raises $100M at $1B valuation as voice AI and robotics infrastructure demand explodes
Jan 23, 2026 with Russ D'Sa
Key Points
- LiveKit raises $100 million at $1 billion valuation, led by Index Ventures with Salesforce, Altimeter, and Redpoint, after an unplanned pivot into voice AI infrastructure powered by OpenAI's ChatGPT voice mode.
- The company positions itself as the Next.js for voice AI and robotics, building a full stack from network infrastructure through deployment and observability to eliminate vendor stitching.
- Robotics emerges as the next major wave with 80% architectural overlap to voice AI, though it requires local inference capabilities for field connectivity and represents a larger long-term market.
Summary
LiveKit has closed a $100 million round at a $1 billion valuation, led by Index Ventures with participation from Salesforce, Altimeter, and Redpoint. CEO Russ frames the milestone as day zero, not a finish line, crediting the company's growth to an unplanned pivot into AI infrastructure after partnering with OpenAI on ChatGPT voice mode.
LiveKit started as video conferencing and live-streaming infrastructure. The OpenAI collaboration forced a rethink, and voice AI demand pulled the company into a far larger opportunity. Russ describes LiveKit as an accidental AI company that now intends to own the full development lifecycle for voice and multimodal applications.
The platform bet is the central use of the new capital. LiveKit is positioning itself as the Next.js equivalent for voice AI and robotics, arguing that the architectural requirements of voice-first and vision-enabled applications are fundamentally different from web applications. The company is building every layer of that stack, from network infrastructure through testing, deployment, and observability, so developers can go from zero to production without stitching together third-party vendors.
The second capital priority is developer relations: sample applications, workshops, and events designed to accelerate adoption. Despite the large raise, LiveKit reports healthy margins and says there is no near-term pressure to build proprietary data centers, though full vertical integration remains a stated long-term goal.
On latency, Russ identifies three primary leverage points for reducing response delay in voice interfaces. First, architecture choice: a single voice-to-voice model cuts latency versus a cascaded three-model pipeline where audio is converted to text, processed by an LLM, then converted back to speech. Second, turn detection, specifically how quickly the system determines a user has finished speaking. Third, GPU load balancing, ensuring a model instance is already primed and waiting rather than queuing requests sequentially.
Voice AI use case design splits into two distinct requirement profiles. B2C personal assistant applications demand low latency and high empathy, essentially human-like conversational realism. B2B deployments, such as hospital patient intake, prioritize reliability and task completion accuracy over emotional texture. Russ argues the tradeoffs are meaningfully different and should drive separate design decisions.
Robotics is identified as the next major wave, sharing roughly 80% architectural overlap with voice AI given that humanoid robots will be controlled through speech and vision rather than keyboards. The key divergence is connectivity: robots operating in the field may face intermittent network access, requiring local or local-network inference capabilities. Russ characterizes robotics as one wave behind voice AI in adoption maturity but expects it to ultimately be a larger market.
On Apple's Siri, Russ attributes current shortcomings primarily to reliability failures rather than model quality. The reported integration of Google's Gemini Live voice-to-voice model addresses realism, while Apple's licensing deal with Hume, a company specializing in real-time sentiment and emotion analysis, is flagged as a meaningful unlock for empathy. At Apple's global scale, voice assistant design must also solve for multilingual performance, accent fidelity, and culturally appropriate paralinguistic cues, constraints that don't apply to narrower enterprise deployments.