Princeton University, CIS
Abstract
The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding
market capitalization, presents both transformative opportunities and critical
challenges. Chief among these is the urgent need for a new, unified paradigm
for trustworthy evaluation, as current benchmarks increasingly reveal critical
vulnerabilities. Issues like data contamination and selective reporting by
model developers fuel hype, while inadequate data quality control can lead to
biased evaluations that, even if unintentionally, may favor specific
approaches. As a flood of participants enters the AI space, this "Wild West" of
assessment makes distinguishing genuine progress from exaggerated claims
exceptionally difficult. Such ambiguity blurs scientific signals and erodes
public confidence, much as unchecked claims would destabilize financial markets
reliant on credible oversight from agencies like Moody's.
In high-stakes human examinations (e.g., SAT, GRE), substantial effort is
devoted to ensuring fairness and credibility; why settle for less in evaluating
AI, especially given its profound societal impact? This position paper argues
that the current laissez-faire approach is unsustainable. We contend that true,
sustainable AI advancement demands a paradigm shift: a unified, live, and
quality-controlled benchmarking framework robust by construction, not by mere
courtesy and goodwill. To this end, we dissect the systemic flaws undermining
today's AI evaluation, distill the essential requirements for a new generation
of assessments, and introduce PeerBench, a community-governed, proctored
evaluation blueprint that embodies this paradigm through sealed execution, item
banking with rolling renewal, and delayed transparency. Our goal is to pave the
way for evaluations that can restore integrity and deliver genuinely
trustworthy measures of AI progress.
AI Insights - PeerBench introduces sealed execution, item banking with rolling renewal, and delayed transparency to curb data contamination.
- The communityâgoverned, proctored framework ensures benchmark items are refreshed continuously, preventing stale or biased tests.
- Holistic evaluation, defined as assessing fairness, safety, and interpretability together, is essential for trustworthy AI progress.
- Data contamination, the intentional or accidental alteration of datasets to bias results, remains a critical vulnerability in LLM benchmarks.
- âCan We Trust AI Benchmarks?â and âBetterBench: Assessing AI Benchmarksâ provide actionable bestâpractice guidelines for researchers.
- The video âAI as a Sport: On the Competitive Epistemologies of Benchmarkingâ illustrates how current tests resemble a zeroâsum game.
- The paperâs call for a unified, live benchmarking ecosystem echoes the rigor of SAT/GRE fairness audits, promising renewed public confidence.
Abstract
With the rapid progress of Large Language Models (LLMs), the general public
now has easy and affordable access to applications capable of answering most
health-related questions in a personalized manner. These LLMs are increasingly
proving to be competitive, and now even surpass professionals in some medical
capabilities. They hold particular promise in low-resource settings,
considering they provide the possibility of widely accessible, quasi-free
healthcare support. However, evaluations that fuel these motivations highly
lack insights into the social nature of healthcare, oblivious to health
disparities between social groups and to how bias may translate into
LLM-generated medical advice and impact users. We provide an exploratory
analysis of LLM answers to a series of medical questions spanning key clinical
domains, where we simulate these questions being asked by several patient
profiles that vary in sex, age range, and ethnicity. By comparing natural
language features of the generated responses, we show that, when LLMs are used
for medical advice generation, they generate responses that systematically
differ between social groups. In particular, Indigenous and intersex patients
receive advice that is less readable and more complex. We observe these trends
amplify when intersectional groups are considered. Considering the increasing
trust individuals place in these models, we argue for higher AI literacy and
for the urgent need for investigation and mitigation by AI developers to ensure
these systemic differences are diminished and do not translate to unjust
patient support. Our code is publicly available on GitHub.