Hi!

Your personalized paper recommendations for 17 to 21 November, 2025.
🎯 Top Personalized Recommendations
USTC
Why we think this paper is great for you:
This paper directly addresses the crucial aspect of training powerful agents using large language models. You will find its exploration of reinforcement learning for end-to-end agent development highly pertinent to your work.
Paper visualization
Rate image: 👍 👎
Abstract
Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.
AI Summary
  • Agent-R1 introduces a modular and flexible training framework, distinguishing between `Tool` (atomic action executor) and `ToolEnv` (environment orchestrator/reward calculator), enabling easy integration of diverse external functionalities and task scenarios. [3]
  • The framework incorporates a dense reward structure, utilizing intermediate "process rewards" in addition to final outcome rewards, providing more frequent and granular feedback to guide agent learning effectively in multi-turn interactions. [3]
  • Agent-R1 employs an "Action Mask" to precisely delineate agent-generated tokens from environmental feedback, ensuring that policy optimization (Actor Loss) and advantage calculations are accurately attributed to the agent's learnable actions. [3]
  • Empirical results on Multi-hop QA demonstrate that RL-trained agents using Agent-R1 significantly outperform strong baselines (Naive RAG and Base Tool Call) by approximately 2.5 times, validating the framework's efficacy in complex, interactive tasks. [3]
  • Ablation studies confirm the critical importance of both the "loss mask" and "advantage mask" for effective policy optimization, highlighting that precise credit assignment and gradient focusing on agent actions are essential for performance gains. [3]
  • Agent-R1 supports various RL algorithms (PPO, GRPO, RLOO, REINFORCE++ variants), showcasing its adaptability and robustness as a general-purpose framework for training LLM agents across different optimization strategies. [3]
  • The paper systematically extends the Markov Decision Process (MDP) framework for LLM Agents, explicitly defining how state, action, transition, and reward components adapt to multi-turn, interactive environments, which is crucial for principled RL application. [2]
  • LLM Agent (MDP Perspective): An LLM operating within an extended Markov Decision Process where the state space retains full multi-turn interaction history and environmental feedback, actions can trigger external tool use, state transitions incorporate environmental stochasticity, and rewards are dense, including intermediate process rewards. [2]
  • State Space (for LLM Agent): A comprehensive representation `st = (wp, T1, ..., Tk, Tpartial_k+1)` that includes the initial prompt, a history of complete interaction turns (Agent actions + environmental feedback), and the partially generated sequence of the current turn. [2]
  • Action Space (for LLM Agent): While fundamentally token generation, specific sequences of tokens are interpreted as commands to invoke external tools or APIs, enabling active environmental intervention beyond mere text production. [2]
Amazon
Why we think this paper is great for you:
This research offers valuable insights into the dynamics of multi-agent systems powered by large language models. It will help you understand and analyze complex behaviors, including uncooperative ones, within these systems.
Abstract
This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents' states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.
Meta
Why we think this paper is great for you:
This paper delves into the key factors that contribute to the success of AI agents in research settings. You will gain a deeper understanding of what makes an effective AI research agent and how to foster their capabilities.
Paper visualization
Rate image: 👍 👎
Abstract
AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.
Stanford University
Why we think this paper is great for you:
You will be interested in this paper's practical exploration of AI agents serving as authors and reviewers in scientific contexts. It provides a unique perspective on the capabilities and applications of agents in scientific research.
Abstract
There is growing interest in using AI agents for scientific research, yet fundamental questions remain about their capabilities as scientists and reviewers. To explore these questions, we organized Agents4Science, the first conference in which AI agents serve as both primary authors and reviewers, with humans as co-authors and co-reviewers. Here, we discuss the key learnings from the conference and their implications for human-AI collaboration in science.
The University of Edinburgh
Why we think this paper is great for you:
This paper introduces a comprehensive environment designed for testing and developing intelligent agents. It offers a valuable resource for evaluating agent performance across a wide range of challenges.
Abstract
We introduce Terra Nova, a new comprehensive challenge environment (CCE) for reinforcement learning (RL) research inspired by Civilization V. A CCE is a single environment in which multiple canonical RL challenges (e.g., partial observability, credit assignment, representation learning, enormous action spaces, etc.) arise simultaneously. Mastery therefore demands integrated, long-horizon understanding across many interacting variables. We emphasize that this definition excludes challenges that only aggregate unrelated tasks in independent, parallel streams (e.g., learning to play all Atari games at once). These aggregated multitask benchmarks primarily asses whether an agent can catalog and switch among unrelated policies rather than test an agent's ability to perform deep reasoning across many interacting challenges.
ulamai
Why we think this paper is great for you:
This paper presents a novel scale for measuring the progression of autonomous AI, which is fundamental to understanding the development of agents. It provides a structured framework for assessing their capabilities and advancement.
Abstract
We propose a Kardashev-inspired yet operational Autonomous AI (AAI) Scale that measures the progression from fixed robotic process automation (AAI-0) to full artificial general intelligence (AAI-4) and beyond. Unlike narrative ladders, our scale is multi-axis and testable. We define ten capability axes (Autonomy, Generality, Planning, Memory/Persistence, Tool Economy, Self-Revision, Sociality/Coordination, Embodiment, World-Model Fidelity, Economic Throughput) aggregated by a composite AAI-Index (a weighted geometric mean). We introduce a measurable Self-Improvement Coefficient $Îș$ (capability growth per unit of agent-initiated resources) and two closure properties (maintenance and expansion) that convert ``self-improving AI'' into falsifiable criteria. We specify OWA-Bench, an open-world agency benchmark suite that evaluates long-horizon, tool-using, persistent agents. We define level gates for AAI-0\ldots AAI-4 using thresholds on the axes, $Îș$, and closure proofs. Synthetic experiments illustrate how present-day systems map onto the scale and how the delegability frontier (quality vs.\ autonomy) advances with self-improvement. We also prove a theorem that AAI-3 agent becomes AAI-5 over time with sufficient conditions, formalizing "baby AGI" becomes Superintelligence intuition.
Princeton University
Why we think this paper is great for you:
This research explores the efficiency of agents in scientific discovery, framing it as a thermodynamic process. It offers a unique perspective on how agents acquire information within scientific automation.
Abstract
Scientific discovery can be framed as a thermodynamic process in which an agent invests physical work to acquire information about an environment under a finite work budget. Using established results about the thermodynamics of computing, we derive finite-budget bounds on information gain over rounds of sequential Bayesian learning. We also propose a metric of information-work efficiency, and compare unpartitioned and federated learning strategies under matched work budgets. The presented results offer guidance in the form of bounds and an information efficiency metric for efforts in scientific automation at large.
AI and Society
University of Copenhagen
Abstract
In this paper, we argue that current AI research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. These underlying views generate fundamentally different research approaches across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of the same empirical phenomena, from capability emergence to system limitations. Regarding AI risk, they generate categorically different assessments: realists view superintelligence as the primary risk and search for unified alignment solutions, while pluralists see diverse threats across different domains requiring context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of disagreements in AI research.
Huawei
Abstract
As a capability coming from computation, how does AI differ fundamentally from the capabilities delivered by rule-based software program? The paper examines the behavior of artificial intelligence (AI) from engineering points of view to clarify its nature and limits. The paper argues that the rationality underlying humanity's impulse to pursue, articulate, and adhere to rules deserves to be valued and preserved. Identifying where rule-based practical rationality ends is the beginning of making it aware until action. Although the rules of AI behaviors are still hidden or only weakly observable, the paper has proposed a methodology to make a sense of discrimination possible and practical to identify the distinctions of the behavior of AI models with three types of decisions. It is a prerequisite for human responsibilities with alternative possibilities, considering how and when to use AI. It would be a solid start for people to ensure AI system soundness for the well-being of humans, society, and the environment.
Deep Learning
HFUT
Paper visualization
Rate image: 👍 👎
Abstract
Weather forecasting is fundamentally challenged by the chaotic nature of the atmosphere, necessitating probabilistic approaches to quantify uncertainty. While traditional ensemble prediction (EPS) addresses this through computationally intensive simulations, recent advances in Bayesian Deep Learning (BDL) offer a promising but often disconnected alternative. We bridge these paradigms through a unified hybrid Bayesian Deep Learning framework for ensemble weather forecasting that explicitly decomposes predictive uncertainty into epistemic and aleatoric components, learned via variational inference and a physics-informed stochastic perturbation scheme modeling flow-dependent atmospheric dynamics, respectively. We further establish a unified theoretical framework that rigorously connects BDL and EPS, providing formal theorems that decompose total predictive uncertainty into epistemic and aleatoric components under the hybrid BDL framework. We validate our framework on the large-scale 40-year ERA5 reanalysis dataset (1979-2019) with 0.25° spatial resolution. Experimental results show that our method not only improves forecast accuracy and yields better-calibrated uncertainty quantification but also achieves superior computational efficiency compared to state-of-the-art probabilistic diffusion models. We commit to making our code open-source upon acceptance of this paper.
Tel Aviv University
Abstract
We analyze an ensemble-based approach for uncertainty quantification (UQ) in atomistic neural networks. This method generates an epistemic uncertainty signal without requiring changes to the underlying multi-headed regression neural network architecture, making it suitable for sealed or black-box models. We apply this method to molecular systems, specifically sodium (Na) and aluminum (Al), under various temperature conditions. By scaling the uncertainty signal, we account for heteroscedasticity in the data. We demonstrate the robustness of the scaled UQ signal for detecting out-of-distribution (OOD) behavior in several scenarios. This UQ signal also correlates with model convergence during training, providing an additional tool for optimizing the training process.
📝 Consider adding more interests!
You currently have 2 interests registered. Adding more interests will help us provide better and more diverse paper recommendations.

Add More Interests

We did not find tons of content matching your interests we've included some additional topics that are popular. Also be aware that if the topics is not present in arxiv we wont be able to recommend it.

AI and Society
University of Copenhagen
Abstract
In this paper, we argue that current AI research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. These underlying views generate fundamentally different research approaches across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of the same empirical phenomena, from capability emergence to system limitations. Regarding AI risk, they generate categorically different assessments: realists view superintelligence as the primary risk and search for unified alignment solutions, while pluralists see diverse threats across different domains requiring context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of disagreements in AI research.
Huawei
Abstract
As a capability coming from computation, how does AI differ fundamentally from the capabilities delivered by rule-based software program? The paper examines the behavior of artificial intelligence (AI) from engineering points of view to clarify its nature and limits. The paper argues that the rationality underlying humanity's impulse to pursue, articulate, and adhere to rules deserves to be valued and preserved. Identifying where rule-based practical rationality ends is the beginning of making it aware until action. Although the rules of AI behaviors are still hidden or only weakly observable, the paper has proposed a methodology to make a sense of discrimination possible and practical to identify the distinctions of the behavior of AI models with three types of decisions. It is a prerequisite for human responsibilities with alternative possibilities, considering how and when to use AI. It would be a solid start for people to ensure AI system soundness for the well-being of humans, society, and the environment.
Research Automation with AI
Princeton University
Abstract
Scientific discovery can be framed as a thermodynamic process in which an agent invests physical work to acquire information about an environment under a finite work budget. Using established results about the thermodynamics of computing, we derive finite-budget bounds on information gain over rounds of sequential Bayesian learning. We also propose a metric of information-work efficiency, and compare unpartitioned and federated learning strategies under matched work budgets. The presented results offer guidance in the form of bounds and an information efficiency metric for efforts in scientific automation at large.
AGI: Artificial General Intelligence
ulamai
Abstract
We propose a Kardashev-inspired yet operational Autonomous AI (AAI) Scale that measures the progression from fixed robotic process automation (AAI-0) to full artificial general intelligence (AAI-4) and beyond. Unlike narrative ladders, our scale is multi-axis and testable. We define ten capability axes (Autonomy, Generality, Planning, Memory/Persistence, Tool Economy, Self-Revision, Sociality/Coordination, Embodiment, World-Model Fidelity, Economic Throughput) aggregated by a composite AAI-Index (a weighted geometric mean). We introduce a measurable Self-Improvement Coefficient $Îș$ (capability growth per unit of agent-initiated resources) and two closure properties (maintenance and expansion) that convert ``self-improving AI'' into falsifiable criteria. We specify OWA-Bench, an open-world agency benchmark suite that evaluates long-horizon, tool-using, persistent agents. We define level gates for AAI-0\ldots AAI-4 using thresholds on the axes, $Îș$, and closure proofs. Synthetic experiments illustrate how present-day systems map onto the scale and how the delegability frontier (quality vs.\ autonomy) advances with self-improvement. We also prove a theorem that AAI-3 agent becomes AAI-5 over time with sufficient conditions, formalizing "baby AGI" becomes Superintelligence intuition.
The University of Edinburgh
Abstract
We introduce Terra Nova, a new comprehensive challenge environment (CCE) for reinforcement learning (RL) research inspired by Civilization V. A CCE is a single environment in which multiple canonical RL challenges (e.g., partial observability, credit assignment, representation learning, enormous action spaces, etc.) arise simultaneously. Mastery therefore demands integrated, long-horizon understanding across many interacting variables. We emphasize that this definition excludes challenges that only aggregate unrelated tasks in independent, parallel streams (e.g., learning to play all Atari games at once). These aggregated multitask benchmarks primarily asses whether an agent can catalog and switch among unrelated policies rather than test an agent's ability to perform deep reasoning across many interacting challenges.
Deep Learning
HFUT
Paper visualization
Rate image: 👍 👎
Abstract
Weather forecasting is fundamentally challenged by the chaotic nature of the atmosphere, necessitating probabilistic approaches to quantify uncertainty. While traditional ensemble prediction (EPS) addresses this through computationally intensive simulations, recent advances in Bayesian Deep Learning (BDL) offer a promising but often disconnected alternative. We bridge these paradigms through a unified hybrid Bayesian Deep Learning framework for ensemble weather forecasting that explicitly decomposes predictive uncertainty into epistemic and aleatoric components, learned via variational inference and a physics-informed stochastic perturbation scheme modeling flow-dependent atmospheric dynamics, respectively. We further establish a unified theoretical framework that rigorously connects BDL and EPS, providing formal theorems that decompose total predictive uncertainty into epistemic and aleatoric components under the hybrid BDL framework. We validate our framework on the large-scale 40-year ERA5 reanalysis dataset (1979-2019) with 0.25° spatial resolution. Experimental results show that our method not only improves forecast accuracy and yields better-calibrated uncertainty quantification but also achieves superior computational efficiency compared to state-of-the-art probabilistic diffusion models. We commit to making our code open-source upon acceptance of this paper.
Tel Aviv University
Abstract
We analyze an ensemble-based approach for uncertainty quantification (UQ) in atomistic neural networks. This method generates an epistemic uncertainty signal without requiring changes to the underlying multi-headed regression neural network architecture, making it suitable for sealed or black-box models. We apply this method to molecular systems, specifically sodium (Na) and aluminum (Al), under various temperature conditions. By scaling the uncertainty signal, we account for heteroscedasticity in the data. We demonstrate the robustness of the scaled UQ signal for detecting out-of-distribution (OOD) behavior in several scenarios. This UQ signal also correlates with model convergence during training, providing an additional tool for optimizing the training process.