University of Vienna
AI Insights - The paper discusses the challenges of training deep reinforcement learning models using hyperbolic spaces. [3]
- The paper proposes a new algorithm for training deep reinforcement learning models in hyperbolic spaces using a combination of gradient descent and trust region methods. [3]
- Experiments show that the proposed algorithm can improve the performance of neural networks on various tasks, including image classification and object detection. [3]
- Hyperbolic space: A mathematical space with non-Euclidean geometry where distances between points are measured using a hyperbolic metric. [3]
- Gradient descent: An optimization algorithm used to minimize the loss function of a neural network by iteratively adjusting the model's parameters. [3]
- Trust region methods: A class of optimization algorithms that use a trust region, which is an approximate region around the current estimate of the optimal solution. [3]
- Deep reinforcement learning: A subfield of machine learning that uses deep neural networks to learn policies for sequential decision-making tasks. [3]
- The proposed algorithm can improve the performance of neural networks on various tasks by leveraging the benefits of hyperbolic spaces. [3]
- Hyperbolic spaces have been shown to be effective in modeling complex relationships between objects and can improve the performance of neural networks. [2]
Abstract
The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the PoincarΓ© Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .
Why we are recommending this paper?
Due to your Interest in: Deep Learning for Reinforcement Learning
This paper directly addresses the crucial role of feature representations in reinforcement learning, a key area of interest. Exploring hyperbolic spaces for RL aligns with the user's focus on optimizing agent performance within complex environments.
UIUC
AI Insights - The four adaptation paradigms in agentic AI are A1 (agent adaptation with tool-execution result as signal), A2 (agent adaptation with agent output as signal), T1 (tool adaptation with agent output as signal), and T2 (tool adaptation with agent output as signal). [3]
- A1 methods use the actual outcomes of external tool invocations as feedback to refine an agent's behavior. [3]
- Recent A1 methods include Toolformer, TRICE, Gorilla, ToolAlpaca, and others, which have achieved state-of-the-art performance on various tasks such as question-answering, math reasoning, and web search. [3]
- The RLVR (Reinforcement Learning with Value Regularization) framework is a key component of many recent A1 methods, allowing for more efficient learning and better generalization. [3]
- A2 methods focus on evaluating an agent's own outputs, rather than relying on tool execution results as feedback. [3]
- The development timeline of A1 methods shows a shift from earlier methods such as SFT (Self-Modifying Task) and DPO (Dynamic Policy Optimization) to more recent RLVR-based methods. [3]
- Recent A1 methods have achieved state-of-the-art performance on various tasks, including question-answering, math reasoning, web search, and text-to-SQL. [3]
- The development timeline of A1 methods shows a rapid growth in research, with many new methods being proposed between 2023 and 2025. [2]
- T1 and T2 methods involve adapting tools based on the agent's output, which can be useful in scenarios where the agent needs to interact with multiple tools or environments. [1]
Abstract
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.
Why we are recommending this paper?
Due to your Interest in: Agentic RL
This work centers on adaptation, a core component of agentic AI systems, which is a significant area of interest for the user. The paperβs focus on improving performance through adaptation aligns directly with the userβs interest in agentic RL.
Georgia Tech
AI Insights - The authors present three formulations of spectral representations: linear, energy-based, and Gaussian process-based, each with its own advantages and limitations. [3]
- PAC (Probably Approximately Correct): A guarantee that an algorithm will produce a solution that is close to the optimal one with high probability. [3]
- The paper discusses a framework for spectral representations in reinforcement learning (RL), which provides a unified approach to representing value functions and policies using kernel methods. [2]
Abstract
In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.
Why we are recommending this paper?
Due to your Interest in: Reinforcement Learning
This paper investigates function approximation techniques, a fundamental aspect of reinforcement learning. The use of spectral representations is relevant to the user's interest in efficient representation learning for RL.
Sapienza University of
AI Insights - The paper discusses a model-based reinforcement learning algorithm called QR-MAX, which is designed for discrete-action non-Markovian reward decision processes. [3]
- QR-MAX is shown to be a PAC-MDP (Probably Approximately Correct Markov Decision Process), meaning that it can deliver an Ο΅-optimal policy with high probability after a finite number of environment interactions. [3]
- The algorithm's performance is evaluated on several benchmark tasks, demonstrating its ability to learn optimal policies in complex environments. [3]
- The paper also discusses the extension of QR-MAX to continuous-state settings using bucketing, which allows for efficient discretization of the state space. [3]
- QR-MAX uses a combination of optimistic planning and value iteration to find an optimal policy in the Markov decision process (MDP) induced by the environment. [2]
- The algorithm maintains accurate estimates of the transition dynamics and rewards using a threshold-based approach, ensuring that the estimates are within a certain margin of error. [1]
Abstract
Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
Why we are recommending this paper?
Due to your Interest in: Reinforcement Learning
The paper tackles the challenge of non-Markovian environments, a critical consideration for many RL applications. This directly addresses the user's interest in reinforcement learning and decision processes.
EPFL
Abstract
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
Why we are recommending this paper?
Due to your Interest in: Agentic RL
This research focuses on exploration strategies within language agents, a key aspect of effective reinforcement learning. The paper's exploration techniques are highly relevant to the user's interest in improving agentic RL.
HI Iberia
Abstract
The development of nuclear fusion requires materials that can withstand extreme conditions. The IFMIF-DONES facility, a high-power particle accelerator, is being designed to qualify these materials. A critical testbed for its development is the MuVacAS prototype, which replicates the final segment of the accelerator beamline. Precise regulation of argon gas pressure within its ultra-high vacuum chamber is vital for this task. This work presents a fully data-driven approach for autonomous pressure control. A Deep Learning Surrogate Model, trained on real operational data, emulates the dynamics of the argon injection system. This high-fidelity digital twin then serves as a fast-simulation environment to train a Deep Reinforcement Learning agent. The results demonstrate that the agent successfully learns a control policy that maintains gas pressure within strict operational limits despite dynamic disturbances. This approach marks a significant step toward the intelligent, autonomous control systems required for the demanding next-generation particle accelerator facilities.
Why we are recommending this paper?
Due to your Interest in: Deep Learning for Reinforcement Learning