Mercor
Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
AI Summary - The ACE benchmark is a comprehensive evaluation of conversational AI models, covering various domains such as DIY, food, gaming, and shopping. [3]
- The benchmark consists of multiple workflows for each domain, with specific instructions and criteria for the models to follow. [3]
- Gemini 2.5 Flash (On), Gemini 3 Pro (High), and o3 (On) also demonstrate strong performance across various domains. [3]
- The benchmark highlights the strengths and weaknesses of each model, providing valuable insights for developers and researchers to improve their conversational AI systems. [3]
- ACE-v1-heldout: A subset of the ACE benchmark used for evaluation, consisting of 100 cases per domain. [3]
- Bootstrapped confidence intervals: A statistical method used to estimate the uncertainty of mean scores by resampling with replacement from the original dataset. [3]
- Domain: A specific category or area of expertise within the ACE benchmark, such as DIY, food, gaming, or shopping. [3]
- Model: A conversational AI system being evaluated on the ACE benchmark, including models like Gemini 2.5 Flash (On), GPT-5 (High), and o3 (On). [3]
- The ACE benchmark provides a comprehensive evaluation of conversational AI models across various domains. [3]
- The results show that GPT-5 (High) and GPT-5.1 (High) perform exceptionally well in most domains, achieving high mean scores and confidence intervals. [2]
Bonn University
Abstract
As computational demands continue to rise, assessing the environmental footprint of AI requires moving beyond energy and water consumption to include the material demands of specialized hardware. This study quantifies the material footprint of AI training by linking computational workloads to physical hardware needs. The elemental composition of the Nvidia A100 SXM 40 GB graphics processing unit (GPU) was analyzed using inductively coupled plasma optical emission spectroscopy, which identified 32 elements. The results show that AI hardware consists of about 90% heavy metals and only trace amounts of precious metals. The elements copper, iron, tin, silicon, and nickel dominate the GPU composition by mass. In a multi-step methodology, we integrate these measurements with computational throughput per GPU across varying lifespans, accounting for the computational requirements of training specific AI models at different training efficiency regimes. Scenario-based analyses reveal that, depending on Model FLOPs Utilization (MFU) and hardware lifespan, training GPT-4 requires between 1,174 and 8,800 A100 GPUs, corresponding to the extraction and eventual disposal of up to 7 tons of toxic elements. Combined software and hardware optimization strategies can reduce material demands: increasing MFU from 20% to 60% lowers GPU requirements by 67%, while extending lifespan from 1 to 3 years yields comparable savings; implementing both measures together reduces GPU needs by up to 93%. Our findings highlight that incremental performance gains, such as those observed between GPT-3.5 and GPT-4, come at disproportionately high material costs. The study underscores the necessity of incorporating material resource considerations into discussions of AI scalability, emphasizing that future progress in AI must align with principles of resource efficiency and environmental responsibility.