Amit Jain on Luma AI's multimodal models serving tens of millions and transforming Hollywood and advertising
Jul 2, 2025 with Amit Jain
Key Points
- Luma AI's unified multimodal architecture, which learns across audio, video, image, and language in a single latent space, outperforms vision-language models bolted onto LLMs and unlocks training data that text-only approaches can't access.
- Video production costs are collapsing from $100,000–$1 million per minute to $10–$100 per minute, putting Hollywood's traditional production model on an extinction path while favoring studios that own IP over those that handle production.
- Advertising agencies lack reliable AI tools for visual iteration at scale; Luma sees the opening as similar to algorithmic text ad optimization, with brands that deploy daily AI-generated variants outcompeting those stuck in slow creative cycles.
Summary
Amit Jain, co-founder and CEO of Luma AI, is building what he describes as the next layer of intelligence after large language models — multimodal models that learn from audio, video, image, and language in a unified latent space rather than bolting modalities onto an LLM backbone. The company has tens of millions of users and works with Hollywood studios, advertising agencies, and individual creators.
The architecture bet
Most AI labs extend LLMs by tokenizing images and video — Jain argues this produces subpar outputs and that even older convolutional networks outperform vision-language models on basic visual tasks. Luma's approach is a joint latent space where all modalities share a single representation, closer to how the brain integrates signals. The practical payoff, he argues, is twofold: it unlocks the vast supply of image, audio, and video data that text-only training can't touch, and it enables genuine visual reasoning rather than prompt-to-pixel translation.
On GPT-4o's image quality, Jain says OpenAI is still running a system rather than a unified model — an autoregressive pass followed by a diffusion pass — which explains both the strengths and the layout failures. He says Luma's next Photon 2 model is making meaningful progress on text rendering within images.
Hollywood's cost collapse
AI is compressing video production costs from roughly $100,000–$1 million per minute down to $10–$100 per minute. Jain is direct about the implication: Hollywood's traditional production model is on a path to extinction. The limiting factor isn't capability gaps — anything the models can't do today, like 4K or precise camera control, Luma is already shipping. The company recently launched a "modify video" feature that accepts a live camera feed as a prompt, enabling exact control over pose and camera movement.
His read on who survives the transition favors the financial and IP side of studios over the production side. The production economics are too far gone. The analog he reaches for is Cocomelon — a YouTube channel that grew from nothing into an IP larger than many Disney properties by iterating fast and releasing continuously.
The more optimistic scenario is that lower production costs unlock more content overall. Average video consumption is already 3.5 hours per phone per day, and Jain's internal success metric is generating 6 billion hours of video per day — one hour per person on Earth. No human workforce can produce at that scale, and current compute capacity can't support it either, but that's the directional target.
Advertising
At Cannes Lions, Jain met with partners including Coca-Cola. The immediate bottleneck for brands isn't creative ambition — it's iteration speed. Brands running 100 ads across different markets need outputs that are on-brand, visually consistent, and market-specific. Current image and video models can't do that reliably; VLMs used to critique the outputs perform poorly. The gap requires a new class of models that understand visual instructions and brand coherence, not just text-to-image translation.
Jain draws a comparison to text advertising, where algorithmic creative iteration has been standard for years and produces strong results. Visual advertising hasn't crossed that threshold yet, but the commercial pressure is obvious — brands that can replace slow creative cycles with daily AI-generated variants will outgrow those that can't.
On FTC rules around AI-generated product imagery, his position is that the regulation should target misrepresentation, not the tool. A burger ad that misleads consumers is the problem regardless of whether it was shot on film or generated by a model. He expects brands to lobby for exactly that framing as performance data from AI creative becomes too compelling to ignore.
Cannes read
Jain describes Cannes as an advertising conference that became an AI conference where almost nobody understood what AI actually is. Agencies were pitching "AI agents" that turned out to be prompt boxes with model selectors — he calls it SaaS, not agents. He's blunt that building real agents for visual and creative work requires new model architectures, and that Gemini, currently the best vision-language model available, meets only about 3% of Luma's internal benchmark for visual iteration tasks.
His overall read: by next year, the fortunes of agencies employing hundreds of thousands of artists will look very different. The question isn't whether disruption is coming — it's whether the people running those agencies understand the magnitude of what's in front of them.