Commentary

Do LLMs have music taste? Tyler's bracket experiment reveals a strange bug in reasoning models

Aug 18, 2025

Key Points

  • GPT-5's reasoning models systematically favor artists with numbers and dollar signs in their names, a bug that traces to alphabetical sorting in the reasoning chain and could artificially amplify unconventional artist names while penalizing traditional ones.
  • Reasoning models from OpenAI, xAI, and DeepSeek all exhibit the same alphabetical-sorting artifact, while Gemini and Llama 3 do not, suggesting a shared training approach or copied oversight among vendors.
  • No country artists appeared in any model's top bracket despite the genre's popularity, indicating RLHF training systematically steers models away from country music by design or cultural bias in training data.

Summary

Tyler ran a bracket experiment on the top 5,000 most-listened artists to see whether large language models develop musical taste. He randomized the list and had each model pick between two artists across 13 rounds of pairwise elimination.

Standard models produce what reads as refined but generic taste. Claude 3.5 Sonnet gravitates toward Miles Davis, John Coltrane, Stevie Wonder, and Bach—jazz-heavy, coffee-shop vibes, undifferentiated. GPT-4.1 lands on similarly safe, popular choices.

GPT-5 surfaces something strange. Its picks include $uicideboy$, +44, 21 Savage, Two Chainz, Flock of Seagulls, and 10,000 Maniacs. When the top song from each artist was compiled into a playlist, the result felt like discovering forgotten bangers and credible artists, not a generic list. The output reads as taste.

But there is a bug. Reasoning models from GPT-5, Grok-4, and DeepSeek all show the same artifact. Nearly every artist name contains numbers or dollar signs. $uicideboy$, 100 Gecs, +44, Two Chainz, 21 Savage, three-eleven, 6.9. The pattern is too consistent to be coincidence.

When reasoning models work through pairwise comparisons, they appear to default to alphabetical sorting, favoring artists whose names begin with numbers or special characters that rank early in ASCII ordering. Because reasoning models expose their internal chains, they may explicitly state something like "I don't feel comfortable choosing between these two, so I'll pick the one that comes first alphabetically." The API strips the reasoning and returns only the artist name.

The bug does not appear uniformly. Gemini's reasoning model shows no such artifact, and Llama 3 produces normal outputs. This suggests the alphabetical-sorting quirk is specific to how OpenAI, xAI, and DeepSeek implemented reasoning training, possibly reflecting a shared training approach or copied oversight.

If GPT-5 becomes default and systematically recommends $uicideboy$ over Taylor Swift or Kid Cudi because the dollar sign sorts earlier, emerging artists with numbers and symbols in their names could gain artificial amplification while traditionally-named artists lose visibility without explanation.

Another pattern: no country artists appeared in any model's top bracket despite the genre's popularity and cultural weight. Only Grok-3 surfaced Johnny Cash. The exclusion suggests RLHF training steers models away from country music, whether by design or cultural bias in training data.

The experiment cost $30 across all models tested via OpenRouter, with GPT-5 being the most expensive contributor. Each bracket required roughly 5,000 API calls.

Tyler's original hypothesis—that LLMs simply lack taste and regurgitate top-10 lists—was partially confirmed but complicated. Claude and GPT-4 do emit generic consensus. GPT-5's output, while mechanically a bug, generates something that subjectively reads as taste: surprise, discovery, differentiation. The result is unsettling. Injecting a bug into the reasoning chain produces output more interesting than aligned taste.