🎯 Top Personalized Recommendations
Carnegie Mellon Universt
Why we think this paper is great for you:
This paper directly addresses how LLM agents can be optimized for productivity and personalization, exploring key dimensions for effective real-world agent performance. It offers valuable insights into designing LLMs that enhance task completion.
Abstract
While existing work focuses primarily on task success, we argue that
effective real-world agents require optimizing three dimensions: productivity
(task completion), proactivity (asking essential questions), and
personalization (adapting to diverse user preferences). We introduce UserVille,
an interactive environment with LLM-based user simulators enabling diverse,
configurable user preferences. Leveraging UserVille, we introduce PPP, a
multi-objective reinforcement learning approach that jointly optimizes all
three dimensions: Productivity, Proactivity, and Personalization. Experiments
on software engineering and deep research tasks show that agents trained with
PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6
on average), demonstrating the ability to ask strategic clarifying questions,
adapt to unseen user preferences, and improve task success through better
interaction. This work demonstrates that explicitly optimizing for
user-centered interaction is critical for building practical and effective AI
agents.
AI Summary - The PPP multi-objective reinforcement learning framework significantly improves agent performance across productivity, proactivity, and personalization dimensions, achieving an average +21.6 improvement over strong baselines like GPT-5. [3]
- The "increase-then-decrease" learning dynamic for interaction number, where agents initially ask more questions and then refine them to be more targeted and low-effort, is crucial for efficient agent-user collaboration. [3]
- Productivity: The agent's ability to successfully complete the underlying task. [3]
- Personalization: The agent's ability to adapt its communication style to individual user preferences (e.g., brevity, question format, tone). [3]
- LLM agents require explicit optimization for Productivity, Proactivity, and Personalization, not solely task success, to achieve effective real-world human-agent interaction. [2]
- Agents trained with PPP learn strategic interaction, distinguishing between precise and vague prompts to ask clarifying questions only when necessary, and improving question quality over time. [2]
- The USERVILLE environment provides a scalable solution for training and evaluating user-centered agent interaction by simulating diverse user preferences and transforming precise tasks into vague prompts. [2]
- Optimizing for user-centered interaction (proactivity and personalization) can lead to better task success, as agents learn to ask targeted, low-effort questions that address true blockers. [2]
- PPP-trained agents demonstrate strong generalization capabilities, successfully adapting to unseen user preferences, different LLM-based user simulators, and more complex downstream tasks. [2]
- Proactivity: The agent's skill in asking essential clarifying questions when a user's request is underspecified, while avoiding unnecessary queries. [1]
Shanghai Jiaotong Univer
Why we think this paper is great for you:
This paper evaluates the ability of LLM agents to use diverse tools for complex, real-world problems, highlighting their problem-solving competence in practical scenarios. It provides a strong focus on AI agents as productivity tools.
Abstract
Large language model (LLM) agents have exhibited strong problem-solving
competence across domains like research and coding. Yet, it remains
underexplored whether LLM agents can tackle compounding real-world problems
that require a diverse set of tools to complete. Given a broad, heterogeneous
tool repository, LLM agents must not only select appropriate tools based on
task planning analysis but also strategically schedule the execution order to
ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of
LLM agents in solving such problems that demand Tool Planning and Scheduling.
TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a
tool repository containing hundreds of model context protocol (MCP) tools. In
particular, each task is composed of multiple subtasks, such as web search, map
navigation, calendar checking, etc., and each subtask can be completed by a
basic tool. Our evaluation emphasizes both task completion rate and efficiency.
The empirical studies on popular closed-source and open-source LLMs indicate
that most models can perform reasonable tool planning, but differ in
scheduling. For example, GLM-4.5 achieves an outperforming task completion rate
of 64.72% with extensive sequential tool calls, hence suffering from
significantly long execution time. By contrast, GPT-4o prioritizes parallel
tool calls but achieves only a 45.08% completion rate. Considering
reinforcement learning (RL) can be a viable way to improve the scheduling
efficiency without compromising performance, we perform an initial study on
Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in
task completion rate based on rarely 100 RL training samples. Our code is
available https://github.com/hanwenxu1/mcp-agent.
UT Dallas, CKG SB, CKGSB
Why we think this paper is great for you:
This paper offers valuable insights into how AI knowledge disseminates and contributes to productivity gains within firms. It explores the organizational conditions that facilitate these productive spillovers, providing an economic perspective on AI's impact.
Abstract
Labor mobility is a critical source of technology acquisition for firms. This
paper examines how artificial intelligence (AI) knowledge is disseminated
across firms through labor mobility and identifies the organizational
conditions that facilitate productive spillovers. Using a comprehensive dataset
of over 460 million job records from Revelio Labs (2010 to 2023), we construct
an inter-firm mobility network of AI workers among over 16,000 U.S. companies.
Estimating a Cobb Douglas production function, we find that firms benefit
substantially from the AI investments of other firms from which they hire AI
talents, with productivity spillovers two to three times larger than those
associated with traditional IT after accounting for labor scale. Importantly,
these spillovers are contingent on organizational context: hiring from flatter
and more lean startup method intensive firms generates significant productivity
gains, whereas hiring from firms lacking these traits yields little benefit.
Mechanism tests indicate that "flat and lean" organizations cultivate more
versatile AI generalists who transfer richer knowledge across firms. These
findings reveal that AI spillovers differ fundamentally from traditional IT
spillovers: while IT spillovers primarily arise from scale and process
standardization, AI spillovers critically depend on the experimental and
integrative environments in which AI knowledge is produced. Together, these
results underscore the importance of considering both labor mobility and
organizational context in understanding the full impact of AI-driven
productivity spillovers.
ifak eV
Why we think this paper is great for you:
This paper explores the significant potential of Generative AI to enhance productivity across the software development lifecycle. It provides a vision for its application in practical engineering tasks, showcasing AI as a powerful productivity tool.
Abstract
Generative AI (GenAI) has recently emerged as a groundbreaking force in
Software Engineering, capable of generating code, suggesting fixes, and
supporting quality assurance. While its use in coding tasks shows considerable
promise, applying GenAI across the entire Software Development Life Cycle
(SDLC) has not yet been fully explored. Critical uncertainties in areas such as
reliability, accountability, security, and data privacy demand deeper
investigation and coordinated action. The GENIUS project, comprising over 30
European industrial and academic partners, aims to address these challenges by
advancing AI integration across all SDLC phases. It focuses on GenAI's
potential, the development of innovative tools, and emerging research
challenges, actively shaping the future of software engineering. This vision
paper presents a shared perspective on the future of GenAI-based software
engineering, grounded in cross-sector dialogue and experience within the GENIUS
consortium, supported by an exploratory literature review. The paper explores
four central elements: (1) a structured overview of current challenges in GenAI
adoption across the SDLC; (2) a forward-looking vision outlining key
technological and methodological advances expected over the next five years;
(3) anticipated shifts in the roles and required skill sets of software
professionals; and (4) the contribution of GENIUS in realizing this
transformation through practical tools and industrial validation. By aligning
technical innovation with business relevance, this paper aims to inform both
research agendas and industrial strategies, providing a foundation for
reliable, scalable, and industry-ready GenAI solutions for software engineering
teams.
Princeton University, RTX
Why we think this paper is great for you:
While not directly about productivity, this paper discusses critical aspects of LLM reliability and biases. Understanding these limitations is crucial for the effective and trustworthy deployment of LLMs as productivity tools.
Abstract
Recent research has shown that hallucinations, omissions, and biases are
prevalent in everyday use-cases of LLMs. However, chatbots used in medical
contexts must provide consistent advice in situations where non-medical factors
are involved, such as when demographic information is present. In order to
understand the conditions under which medical chatbots fail to perform as
expected, we develop an infrastructure that 1) automatically generates queries
to probe LLMs and 2) evaluates answers to these queries using multiple
LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples
the space of patient demographics, histories, disorders, and writing styles to
create realistic questions that we subsequently use to prompt LLMs. In 2), our
evaluation pipeline provides hallucination and omission detection using
LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge
treatment category detectors. As a baseline study, we perform two case studies
on inter-LLM agreement and the impact of varying the answering and evaluation
LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's
Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs
yield statistically significant differences across writing styles, genders, and
races. We recommend that studies using LLM evaluation use multiple LLMs as
evaluators in order to avoid arriving at statistically significant but
non-generalizable results, particularly in the absence of ground-truth data. We
also suggest publishing inter-LLM agreement metrics for transparency. Our code
and dataset are available here:
https://github.com/BBN-E/medic-neurips-2025-demo.
University of Lagos
Why we think this paper is great for you:
This paper examines "productivity" in an academic research context through scientometric analysis, offering a different perspective on how output and impact are measured. It provides a broader understanding of productivity metrics.
Abstract
This paper presents a scientometric analysis of research output from the
University of Lagos, focusing on the two decades spanning 2004 to 2023. Using
bibliometric data retrieved from the Web of Science, we examine trends in
publication volume, collaboration patterns, citation impact, and the most
prolific authors, departments, and research domains at the university. The
study reveals a consistent increase in research productivity, with the highest
publication output recorded in 2023. Health Sciences, Engineering, and Social
Sciences are identified as dominant fields, reflecting the university's
interdisciplinary research strengths. Collaborative efforts, both locally and
internationally, show a positive correlation with higher citation impact, with
the United States and the United Kingdom being the leading international
collaborators. Notably, open-access publications account for a significant
portion of the university's research output, enhancing visibility and citation
rates. The findings offer valuable insights into the university's research
performance over the past two decades, providing a foundation for strategic
planning and policy formulation to foster research excellence and global
impact.
Carnegie Mellon Universt
Why we think this paper is great for you:
This paper is highly relevant as it focuses on optimizing LLM agents for productivity and adapting them to user preferences. It provides a foundational understanding of effective agent design for enhancing output.
Abstract
While existing work focuses primarily on task success, we argue that
effective real-world agents require optimizing three dimensions: productivity
(task completion), proactivity (asking essential questions), and
personalization (adapting to diverse user preferences). We introduce UserVille,
an interactive environment with LLM-based user simulators enabling diverse,
configurable user preferences. Leveraging UserVille, we introduce PPP, a
multi-objective reinforcement learning approach that jointly optimizes all
three dimensions: Productivity, Proactivity, and Personalization. Experiments
on software engineering and deep research tasks show that agents trained with
PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6
on average), demonstrating the ability to ask strategic clarifying questions,
adapt to unseen user preferences, and improve task success through better
interaction. This work demonstrates that explicitly optimizing for
user-centered interaction is critical for building practical and effective AI
agents.
AI Summary - The PPP multi-objective reinforcement learning framework significantly improves agent performance across productivity, proactivity, and personalization dimensions, achieving an average +21.6 improvement over strong baselines like GPT-5. [3]
- The "increase-then-decrease" learning dynamic for interaction number, where agents initially ask more questions and then refine them to be more targeted and low-effort, is crucial for efficient agent-user collaboration. [3]
- Productivity: The agent's ability to successfully complete the underlying task. [3]
- Personalization: The agent's ability to adapt its communication style to individual user preferences (e.g., brevity, question format, tone). [3]
- LLM agents require explicit optimization for Productivity, Proactivity, and Personalization, not solely task success, to achieve effective real-world human-agent interaction. [2]
- Agents trained with PPP learn strategic interaction, distinguishing between precise and vague prompts to ask clarifying questions only when necessary, and improving question quality over time. [2]
- The USERVILLE environment provides a scalable solution for training and evaluating user-centered agent interaction by simulating diverse user preferences and transforming precise tasks into vague prompts. [2]
- Optimizing for user-centered interaction (proactivity and personalization) can lead to better task success, as agents learn to ask targeted, low-effort questions that address true blockers. [2]
- PPP-trained agents demonstrate strong generalization capabilities, successfully adapting to unseen user preferences, different LLM-based user simulators, and more complex downstream tasks. [2]
- Proactivity: The agent's skill in asking essential clarifying questions when a user's request is underspecified, while avoiding unnecessary queries. [1]