Commentary

Timeline: AI evals in crisis, Grok voice mode, and a founder making $30K in 24 hours from one viral tweet

Mar 3, 2025

Key Points

Foundation model evaluation frameworks like MMLU and Chatbot Arena are breaking down as labs overfit results through prompt mining and private bombardment, leaving researchers unable to reliably rank AI capabilities.
A creator built a viral tweet prediction tool and monetized it at a $100 million annualized run rate in 24 hours, but Emmett Shear flags risk that algorithm-aware optimization converges on gaming rather than genuine engagement.
General Catalyst prepares for an IPO that would make it the first major U.S. VC fund to trade publicly, shifting partner compensation from carry to stock options and reshaping alignment between managers and public shareholders.

Summary

The segment covers three tech stories: an evaluation crisis in foundation models, a creator who monetized a tweet-prediction tool for $30,000 in 24 hours, and a venture capital firm preparing for public markets.

Garage mahals in Minnesota

Brett Bailey, a Porsche and Ferrari owner in a Minneapolis suburb, built a nearly 2,700-square-foot "car barn" with a tequila bar, heated floors, a lift, and a taxidermy Kodiak bear. The trend is booming in Minnesota and lakefront communities like Cross Lake, where space constraints make storage buildings valuable. The city council has twice imposed a moratorium on new construction, with residents calling the area "Tin City" because of the metal buildings. Supporters argue the storage boom expands the local tax base, though aesthetic complaints persist. California's restrictive zoning makes similar projects impossible there.

Foundation model evaluation in crisis

Andrej Karpathy identifies a genuine methodological problem. Multiple evaluation frameworks—MMLU, SWE-Bench, Chatbot Arena—have either run their course or become gaming targets. Labs now overfit Chatbot Arena results through prompt mining, private eval bombardment, and using ranking data as training supervision. Karpathy considers an ensemble of private evaluations a possible path forward but admits uncertainty: "my reaction is I don't really know how good these models are right now." The real benchmark is application-specific performance and consumer adoption. Claude wins on coding. ChatGPT dominates consumer apps. The only metric that matters is App Store rankings and whether users actually switch from OpenAI—a 50% improvement threshold matters more than a 10% gain on a math benchmark.

Viral creator monetization

Eddie Zo built an algorithm predicting whether a tweet will go viral before posting. He launched and sold access through Google Meet, generating $30,000 in revenue within 24 hours. Annualized, that run rate exceeds $100 million. The concept addresses real creator pain: posting with confidence and getting minimal engagement, or the reverse. Big creators like MrBeast built similar tools for YouTube thumbnails and titles years ago. Emmett Shear from Softmax flagged "wireheading" as a concern. If creators optimize for the predictor rather than authentic engagement or real-world outcomes, the tool converges on algorithmic gaming instead of genuine human value. Algorithm-aware creation is already ubiquitous. TikTok explicitly surfaces content aligned with platform preferences. The tension between prediction tools and authentic engagement remains unresolved.

Venture capital going public

General Catalyst is considering an IPO, which would make it the first major U.S. VC fund to trade publicly. The firm started with a $73 million fund that posted a 0.7x IRR but kept going. Returns pivoted sharply when Hamad Mustafa led the Stripe Series B. General Catalyst now manages tens of billions across health, private equity, debt, and wealth divisions. Going public lets the firm access more capital and put its financials on display rather than rely on press coverage. Thrive Capital precedent: the firm sold a 3.3% stake in January 2023 to Bob Iger and others for $175 million. That arrangement gave Iger economic alignment with Thrive's returns without him working as a traditional partner. Once General Catalyst is public, carry—the partnership's upside—shifts to stock options and equity in the management company itself. That creates new alignment questions between public shareholders and fund managers.

General Catalyst's largest single return came from Livongo, a chronic care platform sold to Teladoc for $18.5 billion in 2020. Teladoc later wrote down $13.7 billion tied to that acquisition in 2022, suggesting severe mispricing. Square's $29 billion acquisition of Afterpay followed a similar pattern, losing most of its value within two years.

You might also like...

Grok 3 launches: state-of-the-art model benchmarks but lags OpenAI on product

Feb 18, 2025

AI video director Billy Boman on making Super Bowl ads, directing AI commercials for Taylor Swift and Lewis Capaldi, and having his own Hollywood-scale sign

Mar 27, 2026

Google I/O breakdown: impressive technology, unfocused products, and VO3's viral potential

May 22, 2025