LLMs Can Assist with Proposal Selection at Large User Facilities

Oak Ridge National Lab

Rate paper: 👍 👎 ♥ Save

AI Summary

The authors use pairwise comparison and ranking techniques to evaluate the quality of research papers. [3]
The study highlights the potential benefits of AI-assisted peer review, including increased productivity, improved consistency, and enhanced fairness. [3]
Pairwise comparison: A method where two items are compared at a time to determine their relative quality or ranking. [3]
Ranking: The process of assigning a rank or score to each item based on its performance in pairwise comparisons. [3]
Further research is needed to fully explore the benefits and limitations of AI-assisted peer review and to develop more sophisticated evaluation metrics. [3]
The paper presents a method to automate scholarly paper review using large language models (LLMs). [2]

Abstract
We explore how large language models (LLMs) can enhance the proposal selection process at large user facilities, offering a scalable, consistent, and cost-effective alternative to traditional human review. Proposal selection depends on assessing the relative strength among submitted proposals; however, traditional human scoring often suffers from weak inter-proposal correlations and is subject to reviewer bias and inconsistency. A pairwise preference-based approach is logically superior, providing a more rigorous and internally consistent basis for ranking, but its quadratic workload makes it impractical for human reviewers. We address this limitation using LLMs. Leveraging the uniquely well-curated proposals and publication records from three beamlines at the Spallation Neutron Source (SNS), Oak Ridge National Laboratory (ORNL), we show that the LLM rankings correlate strongly with the human rankings (Spearman $ρ\simeq 0.2-0.8$, improving to $\geq 0.5$ after 10\% outlier removal). Moreover, LLM performance is no worse than that of human reviewers in identifying proposals with high publication potential, while costing over two orders of magnitude less. Beyond ranking, LLMs enable advanced analyses that are challenging for humans, such as quantitative assessment of proposal similarity via embedding models, which provides information crucial for review committees.

Why we think this paper is great for you:
This paper directly addresses the application of LLMs within a productivity context, specifically examining their potential to streamline a critical process like proposal selection. Understanding how LLMs can improve efficiency in this area aligns with your interest in AI for productivity tools.

Challenges of Evaluating LLM Safety for User Welfare

Saarland University

Rate paper: 👍 👎 ♥ Save

Abstract
Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.

Why we think this paper is great for you:
The research focuses on the safety implications of LLMs when used for personal advice, a key concern given your interest in the economics of productivity and the potential for harm within productivity tool applications.

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Meta

Rate paper: 👍 👎 ♥ Save

AI Summary

SWE-Bench: a comprehensive benchmark to evaluate autonomous code-writing and code-fixing agents on realistic tasks. [3]
The combination of monorepo development and LLM-based tools like ECO underscores a trend toward holistic scale: treating an entire organization’s code as a single evolvable system, with AI agents providing the intelligence to manage global changes, dependency analysis, and performance tuning in ways humans alone could not easily scale. [2]
Large-scale software engineering has driven interest in AI assistance for code discovery, understanding, and consistent changes at scale. [1]

Abstract
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

Why we think this paper is great for you:
This paper explores the development of sophisticated AI agents for software engineering, a field relevant to productivity through automated tool creation and optimization.

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

Old Dominion University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

The paper discusses the concept of agentic AI and its applications in various domains. [3]
Agentic AI refers to autonomous intelligence that can perform complex tasks and make decisions on its own. [3]
Agentic AI: Autonomous intelligence that can perform complex tasks and make decisions on its own. [3]
The authors propose a framework for integrating large language models (LLMs) with blockchain smart contracts using Model Context Protocol (MCP). [2]
The paper highlights the need for evaluating AI reasoning models in pediatric medicine and discusses the comparative analysis of O3-Mini and O3-Mini-High models. [1]

Abstract
Agentic AI marks a major shift in how autonomous systems reason, plan, and execute multi-step tasks. Unlike traditional single model prompting, agentic workflows integrate multiple specialized agents with different Large Language Models(LLMs), tool-augmented capabilities, orchestration logic, and external system interactions to form dynamic pipelines capable of autonomous decision-making and action. As adoption accelerates across industry and research, organizations face a central challenge: how to design, engineer, and operate production-grade agentic AI workflows that are reliable, observable, maintainable, and aligned with safety and governance requirements. This paper provides a practical, end-to-end guide for designing, developing, and deploying production-quality agentic AI systems. We introduce a structured engineering lifecycle encompassing workflow decomposition, multi-agent design patterns, Model Context Protocol(MCP), and tool integration, deterministic orchestration, Responsible-AI considerations, and environment-aware deployment strategies. We then present nine core best practices for engineering production-grade agentic AI workflows, including tool-first design over MCP, pure-function invocation, single-tool and single-responsibility agents, externalized prompt management, Responsible-AI-aligned model-consortium design, clean separation between workflow logic and MCP servers, containerized deployment for scalable operations, and adherence to the Keep it Simple, Stupid (KISS) principle to maintain simplicity and robustness. To demonstrate these principles in practice, we present a comprehensive case study: a multimodal news-analysis and media-generation workflow. By combining architectural guidance, operational patterns, and practical implementation insights, this paper offers a foundational reference to build robust, extensible, and production-ready agentic AI workflows.

Why we think this paper is great for you:
The research centers on agentic AI workflows, which represent a significant advancement in autonomous systems and their potential to drive productivity gains.

Capability Accumulation and Conditional Convergence: Towards a Dynamic Theory of Economic Complexity

IAST

Rate paper: 👍 👎 ♥ Save

AI Summary

The model provides a simple and parsimonious way to recover the behavior of divergence in economic growth, where previous experience sets on the onset of divergence. [3]
Our work is an additional step towards a dynamic understanding of economic complexity that opens many additional questions. [3]
Convergence clubs and endogenous growth Productivity and convergence across US states and industries Divergence, big time Endogenous technological change A model of growth through creative destruction The o-ring theory of economic development [3]
It also provides a mean to explain why a divergence like that experienced by Rhode Island is not guaranteed an eternal life. [2]

Abstract
We develop a dynamic model of economic complexity that endogenously generates a transition between unconditional and conditional convergence. In this model, convergence turns conditional as the capability intensity of activities rises. We solve the model analytically, deriving closed-form solutions for the boundary separating unconditional from conditional convergence and show that this model also explains the path-dependent diversification process known as the principle of relatedness. This model provides an explanation for transitions between conditional and unconditional convergence and path-dependent diversification.

Why we think this paper is great for you:
This paper’s exploration of economic complexity and convergence patterns offers a theoretical framework relevant to understanding productivity dynamics and how capabilities evolve.

LLMs Can Assist with Proposal Selection at Large User Facilities

Oak Ridge National Lab

Rate paper: 👍 👎 ♥ Save

AI Summary

The authors use pairwise comparison and ranking techniques to evaluate the quality of research papers. [3]
The study highlights the potential benefits of AI-assisted peer review, including increased productivity, improved consistency, and enhanced fairness. [3]
Pairwise comparison: A method where two items are compared at a time to determine their relative quality or ranking. [3]
Ranking: The process of assigning a rank or score to each item based on its performance in pairwise comparisons. [3]
Further research is needed to fully explore the benefits and limitations of AI-assisted peer review and to develop more sophisticated evaluation metrics. [3]
The paper presents a method to automate scholarly paper review using large language models (LLMs). [2]

Abstract
We explore how large language models (LLMs) can enhance the proposal selection process at large user facilities, offering a scalable, consistent, and cost-effective alternative to traditional human review. Proposal selection depends on assessing the relative strength among submitted proposals; however, traditional human scoring often suffers from weak inter-proposal correlations and is subject to reviewer bias and inconsistency. A pairwise preference-based approach is logically superior, providing a more rigorous and internally consistent basis for ranking, but its quadratic workload makes it impractical for human reviewers. We address this limitation using LLMs. Leveraging the uniquely well-curated proposals and publication records from three beamlines at the Spallation Neutron Source (SNS), Oak Ridge National Laboratory (ORNL), we show that the LLM rankings correlate strongly with the human rankings (Spearman $ρ\simeq 0.2-0.8$, improving to $\geq 0.5$ after 10\% outlier removal). Moreover, LLM performance is no worse than that of human reviewers in identifying proposals with high publication potential, while costing over two orders of magnitude less. Beyond ranking, LLMs enable advanced analyses that are challenging for humans, such as quantitative assessment of proposal similarity via embedding models, which provides information crucial for review committees.

Why we think this paper is great for you:
Re-inclusion of this paper due to its direct focus on LLM applications within a productivity-related process, reinforcing the core interest in AI-driven efficiency.

Challenges of Evaluating LLM Safety for User Welfare

Saarland University

Rate paper: 👍 👎 ♥ Save

Abstract
Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.

Why we think this paper is great for you:
Re-inclusion of this paper due to its continued relevance to the safety and ethical considerations surrounding LLM use, particularly concerning welfare and potential impacts on productivity-related decisions.

Help us improve your experience!