Uncertain Metrics: The Hidden Flaws in Crowdsourced AI Benchmarks

As artificial intelligence races forward, the race to measure its progress has intensified. Crowdsourced AI benchmarks, like Chatbot Arena, have gained traction among tech giants such as OpenAI, Google, and Meta, offering a seemingly democratic way to evaluate model performance. These platforms rely on users to compare AI outputs, generating leaderboards that labs tout as proof of superiority. However, a growing chorus of experts warns that these benchmarks are riddled with flaws, raising questions about their validity, ethics, and impact on AI development. Here’s why crowdsourced benchmarks are under scrutiny in 2025.

The Appeal and Mechanics of Crowdsourced Benchmarks

Crowdsourced benchmarks, such as LMSYS’s Chatbot Arena, involve users prompting two anonymous AI models and selecting the preferred response. The resulting votes, often aggregated via an Elo rating system, create public rankings. This approach is appealing for its scale and accessibility, allowing labs to test models with diverse inputs at low cost. For companies, high scores become marketing gold, signaling breakthroughs to investors and users. Yet, beneath the surface, experts argue these systems fall short of scientific rigor.

Key Flaws in Crowdsourced Benchmarks

1. Lack of Construct Validity

Emily Bender, a linguistics professor at the University of Washington, emphasizes that valid benchmarks must measure a specific, well-defined construct with evidence tying measurements to that construct. Chatbot Arena’s user-driven voting often fails this test, as it reflects subjective preferences rather than objective capabilities like reasoning or factual accuracy. Without clear criteria, rankings may reward flashy outputs over substance.

2. Ethical Concerns and Unpaid Labor

Crowdsourcing relies on unpaid user contributions, drawing parallels to exploitative data labeling practices. Kristine Gloria, formerly of the Aspen Institute, argues that evaluators should be compensated, especially given the commercial stakes. The lack of payment raises ethical questions about labor fairness, particularly when labs profit from user inputs.

3. Misaligned Incentives

Asmelash Teka Hadgu of Lesan notes that labs may “game” benchmarks by fine-tuning models for specific tests, as seen with Meta’s Llama 4 Maverick, which outperformed its standard version on Chatbot Arena after optimization. Such practices inflate scores without reflecting real-world utility, misleading stakeholders about a model’s true capabilities.

4. Inconsistent and Noisy Data

Crowdsourced platforms often lack expert oversight, leading to noisy data. For instance, benchmarks like HellaSwag and MMLU, which draw from amateur sources like WikiHow or Mechanical Turk, contain typos and nonsensical questions, undermining reliability. User biases and varying expertise further muddy results, making it hard to draw meaningful conclusions.

5. Limited Real-World Relevance

Experts like Matt Fredrikson of Gray Swan AI argue that crowdsourced benchmarks are no substitute for internal testing or domain-specific evaluations. They often fail to assess practical skills, such as a model’s ability to handle complex, context-dependent tasks in fields like healthcare or law, where precision is critical.

A Path Forward

To address these flaws, experts advocate for dynamic, independently managed benchmarks tailored to specific domains, developed with paid professionals. Asmelash Teka Hadgu suggests distributing benchmarks across universities and organizations to ensure neutrality. Initiatives like Epoch AI’s FrontierMath, crafted with input from Fields Medalists, show promise by prioritizing expert-driven, challenging datasets. Additionally, compensating evaluators and enforcing transparent reporting can enhance fairness and trust.

Why It Matters

Crowdsourced benchmarks shape AI development and regulatory frameworks, yet their flaws risk misdirecting progress. Overreliance on flawed metrics could prioritize hype over substance, delaying meaningful advancements. As Wei-Lin Chiang of LMSYS defends Chatbot Arena’s role as a community feedback tool, the debate underscores a broader need: robust, ethical evaluation systems that reflect AI’s real-world impact.