Training Proactive and Personalized LLM Agents

Carnegie Mellon Universt

Why we think this paper is great for you:
This paper directly addresses how LLM agents can be optimized for productivity and personalization, exploring key dimensions for effective real-world agent performance. It offers valuable insights into designing LLMs that enhance task completion.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.

AI Summary

The PPP multi-objective reinforcement learning framework significantly improves agent performance across productivity, proactivity, and personalization dimensions, achieving an average +21.6 improvement over strong baselines like GPT-5. [3]
The "increase-then-decrease" learning dynamic for interaction number, where agents initially ask more questions and then refine them to be more targeted and low-effort, is crucial for efficient agent-user collaboration. [3]
Productivity: The agent's ability to successfully complete the underlying task. [3]
Personalization: The agent's ability to adapt its communication style to individual user preferences (e.g., brevity, question format, tone). [3]
LLM agents require explicit optimization for Productivity, Proactivity, and Personalization, not solely task success, to achieve effective real-world human-agent interaction. [2]
Agents trained with PPP learn strategic interaction, distinguishing between precise and vague prompts to ask clarifying questions only when necessary, and improving question quality over time. [2]
The USERVILLE environment provides a scalable solution for training and evaluating user-centered agent interaction by simulating diverse user preferences and transforming precise tasks into vague prompts. [2]
Optimizing for user-centered interaction (proactivity and personalization) can lead to better task success, as agents learn to ask targeted, low-effort questions that address true blockers. [2]
PPP-trained agents demonstrate strong generalization capabilities, successfully adapting to unseen user preferences, different LLM-based user simulators, and more complex downstream tasks. [2]
Proactivity: The agent's skill in asking essential clarifying questions when a user's request is underspecified, while avoiding unnecessary queries. [1]

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

Shanghai Jiaotong Univer

Why we think this paper is great for you:
This paper evaluates the ability of LLM agents to use diverse tools for complex, real-world problems, highlighting their problem-solving competence in practical scenarios. It provides a strong focus on AI agents as productivity tools.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.

AI Spillover is Different: Flat and Lean Firms as Engines of AI Diffusion and Productivity Gain

UT Dallas, CKG SB, CKGSB

Why we think this paper is great for you:
This paper offers valuable insights into how AI knowledge disseminates and contributes to productivity gains within firms. It explores the organizational conditions that facilitate these productive spillovers, providing an economic perspective on AI's impact.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Labor mobility is a critical source of technology acquisition for firms. This paper examines how artificial intelligence (AI) knowledge is disseminated across firms through labor mobility and identifies the organizational conditions that facilitate productive spillovers. Using a comprehensive dataset of over 460 million job records from Revelio Labs (2010 to 2023), we construct an inter-firm mobility network of AI workers among over 16,000 U.S. companies. Estimating a Cobb Douglas production function, we find that firms benefit substantially from the AI investments of other firms from which they hire AI talents, with productivity spillovers two to three times larger than those associated with traditional IT after accounting for labor scale. Importantly, these spillovers are contingent on organizational context: hiring from flatter and more lean startup method intensive firms generates significant productivity gains, whereas hiring from firms lacking these traits yields little benefit. Mechanism tests indicate that "flat and lean" organizations cultivate more versatile AI generalists who transfer richer knowledge across firms. These findings reveal that AI spillovers differ fundamentally from traditional IT spillovers: while IT spillovers primarily arise from scale and process standardization, AI spillovers critically depend on the experimental and integrative environments in which AI knowledge is produced. Together, these results underscore the importance of considering both labor mobility and organizational context in understanding the full impact of AI-driven productivity spillovers.

The Future of Generative AI in Software Engineering: A Vision from Industry and Academia in the European GENIUS Project

ifak eV

Why we think this paper is great for you:
This paper explores the significant potential of Generative AI to enhance productivity across the software development lifecycle. It provides a vision for its application in practical engineering tasks, showcasing AI as a powerful productivity tool.

Rate paper: 👍 👎 ♥ Save

Abstract
Generative AI (GenAI) has recently emerged as a groundbreaking force in Software Engineering, capable of generating code, suggesting fixes, and supporting quality assurance. While its use in coding tasks shows considerable promise, applying GenAI across the entire Software Development Life Cycle (SDLC) has not yet been fully explored. Critical uncertainties in areas such as reliability, accountability, security, and data privacy demand deeper investigation and coordinated action. The GENIUS project, comprising over 30 European industrial and academic partners, aims to address these challenges by advancing AI integration across all SDLC phases. It focuses on GenAI's potential, the development of innovative tools, and emerging research challenges, actively shaping the future of software engineering. This vision paper presents a shared perspective on the future of GenAI-based software engineering, grounded in cross-sector dialogue and experience within the GENIUS consortium, supported by an exploratory literature review. The paper explores four central elements: (1) a structured overview of current challenges in GenAI adoption across the SDLC; (2) a forward-looking vision outlining key technological and methodological advances expected over the next five years; (3) anticipated shifts in the roles and required skill sets of software professionals; and (4) the contribution of GENIUS in realizing this transformation through practical tools and industrial validation. By aligning technical innovation with business relevance, this paper aims to inform both research agendas and industrial strategies, providing a foundation for reliable, scalable, and industry-ready GenAI solutions for software engineering teams.

Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Princeton University, RTX

Why we think this paper is great for you:
While not directly about productivity, this paper discusses critical aspects of LLM reliability and biases. Understanding these limitations is crucial for the effective and trustworthy deployment of LLMs as productivity tools.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.

Two Decades of Research at the University of Lagos (2004-2023): A Scientometric Analysis of Productivity, Collaboration, and Impact

University of Lagos

Why we think this paper is great for you:
This paper examines "productivity" in an academic research context through scientometric analysis, offering a different perspective on how output and impact are measured. It provides a broader understanding of productivity metrics.

Rate paper: 👍 👎 ♥ Save

Abstract
This paper presents a scientometric analysis of research output from the University of Lagos, focusing on the two decades spanning 2004 to 2023. Using bibliometric data retrieved from the Web of Science, we examine trends in publication volume, collaboration patterns, citation impact, and the most prolific authors, departments, and research domains at the university. The study reveals a consistent increase in research productivity, with the highest publication output recorded in 2023. Health Sciences, Engineering, and Social Sciences are identified as dominant fields, reflecting the university's interdisciplinary research strengths. Collaborative efforts, both locally and internationally, show a positive correlation with higher citation impact, with the United States and the United Kingdom being the leading international collaborators. Notably, open-access publications account for a significant portion of the university's research output, enhancing visibility and citation rates. The findings offer valuable insights into the university's research performance over the past two decades, providing a foundation for strategic planning and policy formulation to foster research excellence and global impact.

Training Proactive and Personalized LLM Agents

Carnegie Mellon Universt

Why we think this paper is great for you:
This paper is highly relevant as it focuses on optimizing LLM agents for productivity and adapting them to user preferences. It provides a foundational understanding of effective agent design for enhancing output.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.

AI Summary

The PPP multi-objective reinforcement learning framework significantly improves agent performance across productivity, proactivity, and personalization dimensions, achieving an average +21.6 improvement over strong baselines like GPT-5. [3]
The "increase-then-decrease" learning dynamic for interaction number, where agents initially ask more questions and then refine them to be more targeted and low-effort, is crucial for efficient agent-user collaboration. [3]
Productivity: The agent's ability to successfully complete the underlying task. [3]
Personalization: The agent's ability to adapt its communication style to individual user preferences (e.g., brevity, question format, tone). [3]
LLM agents require explicit optimization for Productivity, Proactivity, and Personalization, not solely task success, to achieve effective real-world human-agent interaction. [2]
Agents trained with PPP learn strategic interaction, distinguishing between precise and vague prompts to ask clarifying questions only when necessary, and improving question quality over time. [2]
The USERVILLE environment provides a scalable solution for training and evaluating user-centered agent interaction by simulating diverse user preferences and transforming precise tasks into vague prompts. [2]
Optimizing for user-centered interaction (proactivity and personalization) can lead to better task success, as agents learn to ask targeted, low-effort questions that address true blockers. [2]
PPP-trained agents demonstrate strong generalization capabilities, successfully adapting to unseen user preferences, different LLM-based user simulators, and more complex downstream tasks. [2]
Proactivity: The agent's skill in asking essential clarifying questions when a user's request is underspecified, while avoiding unnecessary queries. [1]

Help us improve your experience!