Standard Intelligence trains computer-use models on 30fps screen capture, demos self-driving with 50 minutes of data
Feb 24, 2026 with Devansh Pandey
Key Points
- Standard Intelligence trains computer-use models on raw 30fps screen video rather than screenshots, sidestepping the limitation that large language models lack GUI interaction training.
- A proof-of-concept model trained on less than 50 minutes of data successfully drove a car in joystick mode, suggesting the video-based approach transfers to robotics tasks.
- The company plans near-term commercialization through CAD automation and a general "tab" product for automated work spans, positioning the action model as a standalone capability outside LLM tool calls.
Summary
Standard Intelligence trains general-purpose computer-use models entirely on 30fps screen capture video rather than screenshots or text-based reasoning chains. The company captures data two ways: employees and contractors running a screen-recording app that logs keystrokes and mouse movements, plus a much larger unlabeled dataset of publicly available computer-use videos from the internet. Standard Intelligence then trains a labeling model on the contractor data to annotate the larger corpus, creating a general model meant to handle any digital task.
Large language models struggle with GUI automation because they were never trained on screen interaction. Graphical interfaces are inherently designed for humans, not text, and many workflows are far more native to video. ML engineering work, which involves analyzing graphs and loss curves, would be cumbersome to describe in text.
Standard Intelligence's self-driving demo suggests the technique transfers beyond software. An engineer named Neil at the company had access to a Comma2 car with joystick mode, where arrow keys control steering. The team tested whether their general computer-use model could navigate it. With less than 50 minutes of training data, the model successfully drove around South Park in San Francisco. Pandey is cautious about the result. It is not a general self-driving system and should not be trusted unsupervised. The fact that an action model trained on diverse computer tasks worked on real vehicle control with minimal data hints at transferability to robotics more broadly.
Pandey positions Standard Intelligence along two near-term paths and one longer-term path. First, CAD automation: mechanical engineers could press a key like tab in Cursor to have the model handle repetitive tasks such as gear extrusion. The demo showcased this capability. Second, a general tab model released to open use, essentially extending Cursor's single-edit completion to 5, 10, or 60 seconds of automated work. Third, longer-term agents that take natural language prompts and execute multi-step work autonomously.
Video training data captures error correction in a way text does not. When humans perform tasks, they make mistakes and visibly fix them on screen. Text corpora on the internet lack that process visibility, so the model learns a native prior for self-correction. It can try something, detect failure, and iteratively fix it until the task is solved.
Standard Intelligence does not yet embed reasoning from an LLM foundation model. The goal is not to make computer use a tool call in Claude or similar systems. It is to scale the action model itself as a standalone capability. The company leaves open the possibility of using text training or LLM initialization to improve reasoning, but that is not the primary product strategy.