Ask any major LLM to pick a random number between 1 and 10, and you'll almost certainly get 7. Ask for a car brand, and expect Toyota or Honda. Request a tagline for a shoe campaign, and both Claude and ChatGPT will independently land on "Run your way." This isn't coincidence — it's a structural flaw baked into how modern AI models are built.

The Homogeneity Problem

A November 2024 paper titled "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" put hard numbers to what many power users had long suspected. Researchers tested more than 70 LLMs, prompting each 50 times to write a metaphor about time. The result: over half of the 3,500 responses were a variation of "Time is a river" — with most of the remainder landing on "Time is a weaver."

The paper won the best paper award at NeurIPS, one of AI's most prestigious conferences. Researchers attribute the convergence to the fact that most models are trained on similar datasets, using similar methods, for similar tasks.

"The way that most chat interfaces are designed, it makes it feel like you're having a personal conversation. I think most people don't really realize the extent to which they are getting the same stuff as everybody else." — Kieran Browne, cofounder and CTO, Springboards

The effect shows up in surprisingly mundane places. Ask any major LLM for band name ideas and you'll get a list heavy with words like "glass," "neon," "velvet," and "static." When one journalist tested this, ChatGPT suggested "Glass Harbor," "Static Empire," "Neon Hearts," and "Velvet Echo" — all in a single output.

Meet Flint: Training for Variety

Australian startup Springboards is trying to engineer its way out of the rut. The company built a model called Flint, trained specifically to produce a wider range of responses to open-ended prompts.

"Most language models are fighting hallucinations. We welcome them." — Pip Bingemann, cofounder and CEO, Springboards

Flint is built on Qwen 3, an open-source model from Chinese tech giant Alibaba. The Springboards team made a deliberate call to avoid training a foundation model from scratch.

"Training a foundation model is not on the table for us. It's just too expensive." — Kieran Browne, CTO, Springboards

Rather than simply cranking up the model's temperature setting — the standard parameter for controlling output randomness — Springboards found that approach too blunt. Maxing out temperature on some models produced responses that broke mid-sentence into incoherent code.

A More Surgical Approach

Instead, the team trained Flint to:

  • Identify specific decision points in its output where variety is possible
  • Inject controlled randomness only at those moments — not across every token
  • Leave the rest of the response coherent and structured

For example, when asked "Where should I go in Europe?" the model only needs to diversify at the moment it selects a destination — not throughout the entire surrounding sentence.

Real-World Testing

Zoe Scaman, founder of strategy startup Bodacious and chief strategy officer at 77X — a direct-to-fan marketing platform set up by Luka Dončić of the LA Lakers — has been running Flint against the mainstream models.

In one MBA-style test — "How would you reinvent a finance company for today's youth?"Claude, Gemini, and ChatGPT all converged on financial literacy gamification. Flint went a different direction, suggesting the entire concept of wealth accumulation needed a rebrand.

"That was really interesting." — Zoe Scaman

She's candid that Flint is still a prototype with rough edges: "It sometimes falls over when you start pushing it too far." But she backs the underlying premise.

Maximilian Weigl, cofounder at marketing firm Uncommon, echoes that view. His team runs Flint alongside the major models as a creative provocation tool.

"You can't really create something boundary-breaking with tools that pull you back to the average." — Maximilian Weigl, Uncommon

What This Means for AI Creativity

Springboards is positioning Flint as a component inside its broader brainstorming platform — a tool for advertising and marketing professionals that lets users drag, mix, and recombine outputs from multiple models. Flint is the "oddball" option users can switch to when they want to break out of the predictable.

OpenAI has acknowledged the tradeoff, noting that training for reliability and coherence naturally pushes models toward high-probability responses — and that forcing novelty can degrade output quality. The company also points out that the Hivemind paper tested 2024-era models that have since been updated.

Still, the core tension remains unresolved across the industry: the same training dynamics that make LLMs trustworthy also make them repetitive. Springboards is betting there's a market — and a technical path — for models that can be both.