Hi!

Your personalized paper recommendations for 12 to 16 January, 2026.
Surge AI
AI Insights
  • Even the best models fail on a substantial portion of tasks, although newer releases show significant improvements over their predecessors. [3]
  • Common-sense reasoning refers to the ability of an agent to understand and apply general knowledge and experience to make decisions and solve problems. [3]
  • Failure modes cluster predictably by capability level, with weaker models struggling on fundamentals and stronger models limited by common-sense reasoning. [2]
Abstract
The advancement of large language model (LLM) based agents has shifted AI evaluation from single-turn response assessment to multi-step task completion in interactive environments. We present an empirical study evaluating frontier AI models on 150 workplace tasks within a realistic e-commerce RL environment from Surge. Our analysis reveals an empirically-derived \emph{hierarchy of agentic capabilities} that models must master for real-world deployment: (1) tool use, (2) planning and goal formation, (3) adaptability, (4) groundedness, and (5) common-sense reasoning. Even the best-performing models fail approximately 40\% of the tasks, with failures clustering predictably along this hierarchy. Weaker models struggle with fundamental tool use and planning, whereas stronger models primarily fail on tasks requiring contextual inference beyond explicit instructions. We introduce a task-centric design methodology for RL environments that emphasizes diversity and domain expert contributions, provide detailed failure analysis, and discuss implications for agent development. Our findings suggest that while current frontier models can demonstrate coherent multi-step behavior, substantial capability gaps remain before achieving human-level task completion in realistic workplace settings.
Why we are recommending this paper?
Due to your Interest in Agentic RL

This paper directly addresses agentic reinforcement learning by evaluating advanced models in realistic, multi-step environments, aligning with your interest in sophisticated agentic capabilities. The focus on e-commerce RL tasks provides a practical context for exploring agentic behaviors.
Renmin University of China
AI Insights
  • The paper also discusses the limitations of KV cache compression, which can lead to information loss and anomalous sequences. [3]
  • KV cache compression: A technique used in large language models to reduce memory usage by compressing the knowledge graph. [3]
  • However, this can lead to information loss and anomalous sequences. [3]
  • It is computationally expensive but provides a complete exploration of the search space. [3]
  • It is used in Sparse-RL to reduce computational costs. [3]
  • In this case, it refers to sequences generated by KV cache compression that are not consistent with the expected behavior. [3]
  • The results show that it outperforms other state-of-the-art methods in terms of accuracy and efficiency. [3]
  • The paper proposes a method called Sparse-RL for training large language models on complex reasoning tasks. [2]
  • The method is evaluated on several benchmarks, including arithmetic reasoning, algebraic manipulation, and logical deduction. [1]
Abstract
Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
Why we are recommending this paper?
Due to your Interest in Reinforcement Learning

Given your interest in deep learning for reinforcement learning, this paper tackles a significant bottleneck – the memory constraints – that often limit the training of LLMs with RL, offering a potential solution for scaling up your work.
Vrije Universiteit Brussel
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Insights
  • The paper discusses the use of Deep Reinforcement Learning (DRL) to solve a stochastic multiplicative growth model, which is a non-ergodic environment. [3]
  • The authors propose an Actor-Critic architecture to learn a policy that generalizes across all probabilities p to gain from the portfolio. [3]
  • They show that the actor-critic model can approximate the full functional form of the optimal policy when trained in a path-dependent setting. [3]
  • Actor-Critic architecture: A type of deep reinforcement learning algorithm that combines the benefits of actor and critic methods. [3]
  • The actor learns a policy to select actions, while the critic learns a value function to estimate the expected return. [3]
  • The Actor-Critic architecture is shown to be a suitable approach for learning policies in such environments. [3]
  • However, the authors also highlight the challenges associated with global policy-learning tasks and the need for further research in this area. [3]
  • However, they also note that this global policy-learning task is more difficult to stabilize and requires a larger number of training episodes and longer game lengths to converge. [2]
Abstract
Reinforcement Learning (RL) remains a central optimisation framework in machine learning. Although RL agents can converge to optimal solutions, the definition of ``optimality'' depends on the environment's statistical properties. The Bellman equation, central to most RL algorithms, is formulated in terms of expected values of future rewards. However, when ergodicity is broken, long-term outcomes depend on the specific trajectory rather than on the ensemble average. In such settings, the ensemble average diverges from the time-average growth experienced by individual agents, with expected-value formulations yielding systematically suboptimal policies. Prior studies demonstrated that traditional RL architectures fail to recover the true optimum in non-ergodic environments. We extend this analysis to deep RL implementations and show that these, too, produce suboptimal policies under non-ergodic dynamics. Introducing explicit time dependence into the learning process can correct this limitation. By allowing the network's function approximation to incorporate temporal information, the agent can estimate value functions consistent with the process's intrinsic growth rate. This improvement does not require altering the environmental feedback, such as reward transformations or modified objective functions, but arises naturally from the agent's exposure to temporal trajectories. Our results contribute to the growing body of research on reinforcement learning methods for non-ergodic systems.
Why we are recommending this paper?
Due to your Interest in Deep Learning for Reinforcement Learning

This paper explores fundamental challenges in reinforcement learning, specifically addressing non-ergodic environments, which are crucial for understanding and optimizing RL algorithms – a core area of your interests.
Lule University of Technology
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Insights
  • Graph-based policy: A type of policy that uses a graph structure to represent the environment and make decisions. [3]
  • Multi-agent reinforcement learning: A type of reinforcement learning where multiple agents learn together to achieve a common goal. [3]
  • The framework demonstrates stable learning, effective cooperation, and balanced workload distribution across agents under suitable reward shaping. [3]
  • Future work will improve the performances in target acquisition, investigate the efficacy of the framework with larger teams, and real-world deployment of the learned policies. [3]
  • The ablation studies only considered three reward terms, and it is unclear how other reward terms would affect the performance of the framework. [3]
  • The paper presents a multi-agent reinforcement learning framework for target acquisition in unknown environments. [2]
  • Safety filter: A mechanism that ensures safe behavior by filtering out actions that could lead to collisions or other unsafe outcomes. [1]
Abstract
This paper introduces a decentralized multi-agent reinforcement learning framework enabling structurally heterogeneous teams of agents to jointly discover and acquire randomly located targets in environments characterized by partial observability, communication constraints, and dynamic interactions. Each agent's policy is trained with the Multi-Agent Proximal Policy Optimization algorithm and employs a Graph Attention Network encoder that integrates simulated range-sensing data with communication embeddings exchanged among neighboring agents, enabling context-aware decision-making from both local sensing and relational information. In particular, this work introduces a unified framework that integrates graph-based communication and trajectory-aware safety through safety filters. The architecture is supported by a structured reward formulation designed to encourage effective target discovery and acquisition, collision avoidance, and de-correlation between the agents' communication vectors by promoting informational orthogonality. The effectiveness of the proposed reward function is demonstrated through a comprehensive ablation study. Moreover, simulation results demonstrate safe and stable task execution, confirming the framework's effectiveness.
Why we are recommending this paper?
Due to your Interest in Agentic RL

The paper’s focus on decentralized multi-agent reinforcement learning with communication regularization aligns well with your interest in agentic RL, particularly the coordination aspects of complex systems.
Technion Israel Institute of Technology
AI Insights
  • The problem of reinforcement learning with multi-step lookahead information is addressed through the use of adaptive batching policies (ABPs), which can adaptively adjust their batch lengths based on the agent's current state and available information. [3]
  • The lookahead information is generated by sampling from a distribution that represents the possible rewards and transitions along all possible trajectories/policies, and then pruning any information on unreachable states to make it equivalent to the lookahead information observed at the beginning of each batch. [3]
  • The analysis focuses on the adaptive batching policy space, where the lookahead information is only observed at the beginning of a batch, after choosing a new batch length, and additional lookahead information is not used in the middle of the batch. [3]
  • The analysis involves deriving regret bounds for ABPs, which provide a measure of the difference between the expected cumulative rewards of the ABP and an optimal policy. [3]
  • Adaptive Batching Policy (ABP): A policy that can adaptively adjust its batch lengths based on the agent's current state and available information. [3]
  • Lookahead Information: The rewards and transitions along all possible trajectories/policies, sampled from a distribution that represents the possible outcomes of the agent's actions. [3]
  • Pruning: The process of removing any information on unreachable states to make it equivalent to the lookahead information observed at the beginning of each batch. [3]
  • Regret Bound: A measure of the difference between the expected cumulative rewards of an ABP and an optimal policy. [3]
  • The problem is formalized as finding an ABP that maximizes the cumulative reward over all batches, subject to constraints on the batch lengths and the available lookahead information. [2]
Abstract
We study tabular reinforcement learning problems with multiple steps of lookahead information. Before acting, the learner observes $\ell$ steps of future transition and reward realizations: the exact state the agent would reach and the rewards it would collect under any possible course of action. While it has been shown that such information can drastically boost the value, finding the optimal policy is NP-hard, and it is common to apply one of two tractable heuristics: processing the lookahead in chunks of predefined sizes ('fixed batching policies'), and model predictive control. We first illustrate the problems with these two approaches and propose utilizing the lookahead in adaptive (state-dependent) batches; we refer to such policies as adaptive batching policies (ABPs). We derive the optimal Bellman equations for these strategies and design an optimistic regret-minimizing algorithm that enables learning the optimal ABP when interacting with unknown environments. Our regret bounds are order-optimal up to a potential factor of the lookahead horizon $\ell$, which can usually be considered a small constant.
Why we are recommending this paper?
Due to your Interest in Reinforcement Learning

This research investigates multi-step lookahead information, a technique directly relevant to improving the efficiency and effectiveness of reinforcement learning algorithms, aligning with your interest in deep learning for RL.