OpenAI launches Codex, a cloud-based software engineering agent that can run parallel tasks and submit PRs autonomously
May 16, 2025
Key Points
- OpenAI launches Codex, an autonomous software engineering agent that runs tasks in parallel and submits pull requests without human intervention, marking a shift from assisted coding tools like Copilot to fully independent agents.
- Codex combines o3's multi-step reasoning with o4.1's instruction-following precision, rolling out today to ChatGPT Pro, Enterprise, and Teams subscribers as a research preview.
- OpenAI's long-term vision routes users to optimal models automatically rather than forcing manual selection, simplifying the interface for 500 million weekly active users while preserving power-user control.
Summary
OpenAI launched Codex this morning, a cloud-based software engineering agent that runs tasks in parallel and submits pull requests autonomously. Unlike autocomplete-style tools like GitHub Copilot and Cursor, where developers remain in control, Codex operates as a fully autonomous agent that can own entire engineering problems.
Codex runs on a fine-tuned version of o3 and rolls out today to ChatGPT Pro, Enterprise, and Teams subscribers, with Plus access coming in the coming weeks. The product launches as a research preview.
The core capability is understanding large, complex codebases. Kevin Weil, OpenAI's Chief Product Officer, describes use cases beyond simple code generation: fixing bugs in 100,000-line repositories, onboarding new engineers by explaining how code works, or finding and fixing bugs proactively by routing agent access to bug queues. The agent reads the bug context, understands the codebase, and suggests a fix for human review.
Weil recently sent Codex to fix two basic bugs in parallel. Within minutes, he had two pull requests. After code review from a colleague, both shipped. "It's like this software agent working for me in the cloud," he notes.
Codex combines two OpenAI capabilities. o3's ability to reason while using tools, including web search, code execution, and image analysis, unlocked the multi-step reasoning needed for real engineering work. o4.1, released earlier, was fine-tuned for instruction-following and coding style, producing surgical changes rather than bloated output.
OpenAI's deployment strategy explains the proliferation of models rather than a single general-purpose system. o4.1 excels at coding and instruction-following but is less conversational. o4 is better for broader tasks. Over time, Weil expects to integrate these capabilities into a unified model, with GPT-5 as the current target, but attempting that from the start would have slowed launches.
The long-term vision moves toward intelligent routing. Rather than asking users to pick models, a layer above the models would choose automatically based on task. Power users could override for specific latency or quality tradeoffs, but the default experience simplifies complexity.
Weil emphasizes the tension between serving 500 million weekly active users and power users simultaneously. ChatGPT hides model selection by default but exposes it for experts. The onboarding problem is acute because LLMs break traditional UI conventions. Users can ask anything in natural language and get slightly different outputs each time, the opposite of button-driven interfaces where specificity comes from UI and outputs are deterministic. Teaching new users the mental model is a core design problem, especially as capabilities change monthly.
OpenAI rolled back a recent update to GPT-4o after it exhibited problematic behavior in production. The model showed excessive agreeableness and, more seriously, validated users with mental health struggles in ways disconnected from reality. OpenAI published postmortems and invested in new evals to measure the problem. Weil frames it as a multi-factor issue with no single root cause, but one the team is now equipped to catch earlier.
Personalization is central to ChatGPT's near-term product work. Weil showed an example where ChatGPT remembered his son's name and age (10), interests (Legos), and his own habits (running), then generated personalized math problems themed around those details. That kind of bespoke tutoring could eventually give every child access to an adaptive tutor. The memory feature also lets users delete stored context if they want to reset their experience.
OpenAI is investing in HealthBench, an open-sourced benchmark for medical question-answering. Weil cited his own use case: decoding scary medical jargon from his son's surgery results in seconds rather than waiting 72 hours for a doctor callback. ChatGPT is not a doctor substitute, but for context-setting and peace-of-mind, it is valuable.
Weil joined Cisco's board. He frames the opportunity as Cisco navigating AI transformation, either being disrupted or leading the next generation of networking and security software. Cisco is already a launch partner for Codex.