Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

IBM Research

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Injecting these hints into the judge's prompt refocuses its reasoning toward previously overlooked issues, yielding significant gains without retraining. [3]
COBOL: a high-level programming language used for business applications, particularly in legacy systems. [3]
Lightweight checker: a tool or algorithm designed to quickly identify potential errors or issues in code, often used in conjunction with LLMs. [3]
Grounded in expert analysis of COBOL evaluation failures, the method couples a taxonomy of blind spots with a lightweight checker that emits targeted hints. [2]
The study presents a practical approach to enhancing Large Language Models (LLMs) using analytic hint injection in code evaluation. [1]

Abstract
Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production deployed LaaJs can miss domain critical errors, revealing consistent blind spots in their evaluation capabilities. To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice. We use its outputs as analytic hints, dynamically injecting them into the judges prompt to encourage LaaJ to revisit aspects it may have overlooked. Experiments on a test set of 100 programs using four production level LaaJs show that LaaJ alone detects only about 45% of the errors present in the code (in all judges we tested), while the analytic checker alone lacks explanatory depth. When combined, the LaaJ+Hints configuration achieves up to 94% coverage (for the best performing judge and injection prompt) and produces qualitatively richer, more accurate explanations, demonstrating that analytic-LLM hybrids can substantially enhance evaluation reliability in deployed pipelines. We release the dataset and all used prompts.

Why we are recommending this paper?
Due to your Interest in: LLMs for Productivity

This paper directly addresses the critical issue of evaluating LLMs, a key area of interest given your focus on productivity tools and AI judging systems. Understanding the limitations of LaaJs is essential for building reliable productivity applications.

Scaling Laws for Energy Efficiency of Local LLMs

Multiverse Computing

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

Why we are recommending this paper?
Due to your Interest in: LLMs for Productivity

Given your interest in the economics of productivity, this research explores the energy efficiency of LLMs, which is a crucial factor in determining the feasibility and scalability of AI-powered productivity solutions. It’s a relevant investigation into resource constraints.

Let the Barbarians In: How AI Can Accelerate Systems Performance Research

UC Berkeley

Rate paper: 👍 👎 ♥ Save

AI Insights

The ADRS framework is a novel approach to automated program synthesis, which enables the rapid exploration of problem variations and the discovery of novel solutions. [2]
The ADRS framework and OpenEvolve have the potential to revolutionize the field of automated program synthesis by enabling rapid exploration of problem variations and discovery of novel solutions. [1]

Abstract
Artificial Intelligence (AI) is beginning to transform the research process by automating the discovery of new solutions. This shift depends on the availability of reliable verifiers, which AI-driven approaches require to validate candidate solutions. Research focused on improving systems performance is especially well-suited to this paradigm because system performance problems naturally admit such verifiers: candidates can be implemented in real systems or simulators and evaluated against predefined workloads. We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems (ADRS). Using several open-source ADRS instances (i.e., OpenEvolve, GEPA, and ShinkaEvolve), we demonstrate across ten case studies (e.g., multi-region cloud scheduling, mixture-of-experts load balancing, LLM-based SQL, transaction scheduling) that ADRS-generated solutions can match or even outperform human state-of-the-art designs. Based on these findings, we outline best practices (e.g., level of prompt specification, amount of feedback, robust evaluation) for effectively using ADRS, and we discuss future research directions and their implications. Although we do not yet have a universal recipe for applying ADRS across all of systems research, we hope our preliminary findings, together with the challenges we identify, offer meaningful guidance for future work as researcher effort shifts increasingly toward problem formulation and strategic oversight. Note: This paper is an extension of our prior work [14]. It adds extensive evaluation across multiple ADRS frameworks and provides deeper analysis and insights into best practices.

Why we are recommending this paper?
Due to your Interest in: AI for Productivity Tools

This work aligns with your interest in productivity by examining how AI can automate and improve research processes, potentially leading to faster advancements in systems performance – a core element of productivity gains. The focus on reliable verifiers is particularly pertinent.

XTC, A Research Platform for Optimizing AI Workload Operators

Inria

Rate paper: 👍 👎 ♥ Save

AI Insights

It decouples scheduling from code generation, enabling fair comparison, reproducible measurement, and rapid prototyping of optimization strategies. [3]
TVM: an automated end-to-end optimizing compiler for deep learning Ansor: generating high-performance tensor programs for deep learning Aidge: a framework for building and optimizing compilers MLIR: A compiler infrastructure for the end of Moore's law XTC is a valuable tool for researchers and developers working on compiler frameworks. [3]
XTC is a research platform for experimenting with scheduling and performance optimization across compiler frameworks. [2]

Abstract
Achieving high efficiency on AI operators demands precise control over computation and data movement. However, existing scheduling languages are locked into specific compiler ecosystems, preventing fair comparison, reuse, and evaluation across frameworks. No unified interface currently decouples scheduling specification from code generation and measurement. We introduce XTC, a platform that unifies scheduling and performance evaluation across compilers. With its common API and reproducible measurement framework, XTC enables portable experimentation and accelerates research on optimization strategies.

Why we are recommending this paper?
Due to your Interest in: AI for Productivity Tools

This paper’s focus on optimizing AI workload operators directly relates to improving the efficiency of AI-driven productivity tools. The research addresses the need for better scheduling and resource management, a critical aspect of productivity.

Occupational Tasks, Automation, and Economic Growth: A Modeling and Simulation Approach

arXiv

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Automation increases productivity and substitution effects, leading to an increase in wages up to a certain point, after which they decrease to zero. [3]
The model provides insights into the impact of automation on labor markets, capital allocation, and economic output, but leaves open questions regarding the interaction of automation, AI, and R&D as co-determinants of economic output. [3]
Automation has both productivity and substitution effects, leading to an increase in wages up to a certain point, after which they decrease to zero. [3]
The task-based framework is a foundational model for understanding the impact of automation on economic output, labor markets, and capital allocation. [2]

Abstract
The Fourth Industrial Revolution commonly refers to the accelerating technological transformation that has been taking place in the 21st century. Economic growth theories which treat the accumulation of knowledge and its effect on production endogenously remain relevant, yet they have been evolving to explain how the current wave of advancements in automation and artificial intelligence (AI) technology will affect productivity and different occupations. The work contributes to current economic discourse by developing an analytical task-based framework that endogenously integrates knowledge accumulation with frictions that describe technological lock-in and the burden of knowledge generation and validation. The interaction between production (or automation) and growth (or knowledge accumulation) is also described explicitly. To study how automation and AI shape economic outcomes, I rely on high-throughput calculations of the developed model. The effect of the model's structural parameters on key variables such as the production output, wages, and labor shares of output is quantified, and possible intervention strategies are briefly discussed. An important result is that wages and labor shares are not directly linked, instead they can be influenced independently through distinct policy levers.

Why we are recommending this paper?
Due to your Interest in: Economics of Productivity

This paper's exploration of the economic impact of automation aligns with your interest in the economics of productivity, particularly concerning the broader implications of technological advancements. The modeling approach offers a valuable framework for understanding these shifts.

Help us improve your experience!