Hi!

Your personalized paper recommendations for 02 to 06 February, 2026.
Tencent
Rate paper: 👍 👎 ♥ Save
AI Insights
  • However, these algorithms may struggle with long-horizon tasks due to the curse of dimensionality and the need for large amounts of data. (ML: 0.95)👍👎
  • Policy Gradient Methods: These are a type of reinforcement learning algorithm that use the policy gradient theorem to optimize the policy of an agent. (ML: 0.91)👍👎
  • The paper assumes that the reward function is known and does not consider the case where the reward function is unknown or partially observable. (ML: 0.87)👍👎
  • This method is designed for long-horizon tasks, where traditional policy gradient methods may struggle. (ML: 0.86)👍👎
  • This method is designed for long-horizon tasks, where traditional policy gradient methods may struggle. (ML: 0.86)👍👎
  • This method is designed for long-horizon tasks, where traditional policy gradient methods may struggle. (ML: 0.86)👍👎
  • The policy gradient theorem states that the gradient of the expected return with respect to the policy parameters is equal to the expected value of the product of the advantage function and the policy gradient. (ML: 0.86)👍👎
  • Traditional reinforcement learning algorithms can struggle with these types of games because they don't take into account the long-term consequences of each move. (ML: 0.86)👍👎
  • Monte Carlo Tree Search (MCTS): This is a type of tree search algorithm that uses Monte Carlo methods to estimate the value of nodes in the search tree. (ML: 0.86)👍👎
  • Imagine you're playing a game like 2048 or Sokoban, where you have to make a series of moves to achieve your goal. (ML: 0.86)👍👎
  • Previous work on reinforcement learning has focused on developing algorithms for short-horizon tasks, such as Q-learning and policy gradient methods. (ML: 0.84)👍👎
  • The paper presents a novel approach called Monte Carlo Policy Optimization (MC-PPO) that combines policy optimization with Monte Carlo tree search. (ML: 0.83)👍👎
  • The paper presents a novel approach called Monte Carlo Policy Optimization (MC-PPO) that combines policy optimization with Monte Carlo tree search. (ML: 0.83)👍👎
  • MCTS is often used in games such as Go, Poker, and Chess. (ML: 0.82)👍👎
  • The paper presents a novel approach called MC-PPO that combines policy optimization with Monte Carlo tree search. (ML: 0.75)👍👎
  • The paper presents a new approach called MC-PPO that combines policy optimization with Monte Carlo tree search to better handle these types of games. (ML: 0.71)👍👎
  • The results show that MC-PPO outperforms the baseline methods on both 2048 and Sokoban variants, demonstrating its superior robustness and generalization. (ML: 0.57)👍👎
Abstract
Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct
Why we are recommending this paper?
Due to your Interest in Agentic RL

This paper directly addresses the challenges of long-horizon planning in interactive environments, a core area of interest within agentic reinforcement learning. The ProAct framework offers a solution to compounding errors, aligning with your focus on robust agent design.
Kyushu University
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • Agentic-PRs are rejected more often than Human-PRs Most rejected PRs lack clear reviewer feedback Simple heuristics can filter out Unknown PRs and retain Non-Unknown PRs Agentic-PR: A pull request submitted by an autonomous coding agent Human-PR: A pull request submitted by a human developer AIDev dataset: A collection of GitHub pull requests involving agentic coding agents The study highlights the challenges faced by agentic coding agents in software development The findings suggest that simple heuristics can be used to filter out Unknown PRs and retain Non-Unknown PRs Future work should investigate implicit signals that may shape rejection decisions The study relies on manual coding of PR discussions (ML: 0.97)👍👎
Abstract
Agentic coding -- software development workflows in which autonomous coding agents plan, implement, and submit code changes with minimal human involvement -- is rapidly gaining traction. Prior work has shown that Pull Requests (PRs) produced using coding agents (Agentic-PRs) are accepted less often than PRs that are not labeled as agentic (Human-PRs). The rejection reasons for a single agent (Claude Code) have been explored, but a comparison of how rejection reasons differ between Agentic-PRs generated by different agents has not yet been performed. This comparison is important since different coding agents are often used for different purposes, which can lead to agent-specific failure patterns. In this paper, we inspect 654 rejected PRs from the AIDev dataset covering five coding agents, as well as a human baseline. Our results show that seven rejection modes occur only in Agentic-PRs, including distrust of AI-generated code. We also observe agent-specific patterns (e.g., automated withdrawal of inactive PRs by Devin), reflecting differences in how agents are configured and used in practice. Notably, a large proportion of rejected PRs (67.9%) lack explicit reviewer feedback, making their rejection reasons difficult to determine. To mitigate this issue, we propose a set of heuristics that reduce the proportion of such cases, offering a practical preprocessing step for future studies of PR rejection in agentic coding.
Why we are recommending this paper?
Due to your Interest in Agentic RL

This research investigates Agentic-PRs, a key application of autonomous coding agents, and their acceptance rates – a critical factor in successful agentic workflows. Understanding the reasons for rejection is vital for improving agentic development strategies.
Princeton University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • However, there are still challenges to be addressed, such as improving sample efficiency, generalizing to new environments, and scaling up to complex tasks. (ML: 0.98)👍👎
  • Generalization: RL models may not generalize well to new environments or tasks, requiring retraining from scratch. (ML: 0.98)👍👎
  • Sample efficiency: RL methods often require large amounts of data to learn effective policies, which can be time-consuming and expensive. (ML: 0.97)👍👎
  • Imagine you're trying to learn how to play a video game. (ML: 0.96)👍👎
  • Reinforcement learning is a subfield of machine learning that involves training agents to take actions in an environment to maximize a reward signal. (ML: 0.96)👍👎
  • That's basically what reinforcement learning is – it's a way for computers to learn from experience and get better at doing things. (ML: 0.96)👍👎
  • Reinforcement learning (RL) is a subfield of machine learning that involves training agents to take actions in an environment to maximize a reward signal. (ML: 0.96)👍👎
  • Scalability: As tasks become more complex, RL methods can struggle to scale up, leading to increased computational costs. (ML: 0.95)👍👎
  • As you play more, you start to notice patterns and figure out which actions lead to the best rewards (like getting extra lives or points). (ML: 0.94)👍👎
  • RL has been successful in various applications, including robotics, game playing, and autonomous driving. (ML: 0.91)👍👎
  • Recent papers have explored various aspects of RL, including model-based and model-free methods, offline RL, and transfer learning. (ML: 0.91)👍👎
  • Some notable results include the development of new algorithms for improving sample efficiency and generalization, as well as applications in robotics, game playing, and autonomous driving. (ML: 0.89)👍👎
  • Model-based RL: A type of RL where the agent learns a model of the environment and uses it to plan its actions. (ML: 0.89)👍👎
  • Model-free RL: A type of RL where the agent learns directly from experience without learning a model of the environment. (ML: 0.87)👍👎
  • You start by making random moves and seeing what happens. (ML: 0.83)👍👎
  • RL has made significant progress in recent years, with many state-of-the-art results achieved through model-based and model-free methods. (ML: 0.65)👍👎
Abstract
How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
Why we are recommending this paper?
Due to your Interest in Reinforcement Learning

This paper tackles fundamental questions about the relationship between compute and reinforcement learning policy performance. Exploring the impact of increased compute resources is highly relevant to optimizing RL algorithms.
Hunan University
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • The method may not perform well on problems with high-dimensional spaces or large numbers of variables. (ML: 0.94)👍👎
  • The authors propose a novel approach that combines the strengths of HC and GH to solve complex optimization problems efficiently. (ML: 0.86)👍👎
  • The proposed method combines the strengths of HC and GH to solve complex optimization problems efficiently. (ML: 0.84)👍👎
  • The paper presents a framework for solving optimization problems using homotopy continuation (HC) and Gaussian homotopy (GH), which are two powerful methods for finding global optima. (ML: 0.80)👍👎
  • The results demonstrate that the proposed method can achieve state-of-the-art performance on challenging problems, making it a valuable tool for researchers and practitioners in various fields. (ML: 0.79)👍👎
  • The results show that the proposed method can achieve state-of-the-art performance on these problems, outperforming existing methods in terms of efficiency and accuracy. (ML: 0.79)👍👎
  • Gaussian homotopy (GH): a method for finding global optima by using a Gaussian kernel to smooth out the objective function and then applying HC to find the smoothed optimal solution. (ML: 0.78)👍👎
  • The proposed method requires careful tuning of hyperparameters, which can be time-consuming and challenging. (ML: 0.78)👍👎
  • Homotopy continuation (HC): a method for finding roots of systems of equations by tracing the solution path as a parameter is varied. (ML: 0.71)👍👎
  • The proposed method is demonstrated on three challenging problems: robust optimization via GNC, global optimization via GH, and polynomial root-finding via HC. (ML: 0.69)👍👎
Abstract
The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on hand-crafted heuristics for step sizes and iteration termination, which are often suboptimal and task-specific. To address this, we unify these problems under a single framework, which enables the design of a general neural solver. Building on this unified view, we propose Neural Predictor-Corrector (NPC), which replaces hand-crafted heuristics with automatically learned policies. NPC formulates policy selection as a sequential decision-making problem and leverages reinforcement learning to automatically discover efficient strategies. To further enhance generalization, we introduce an amortized training mechanism, enabling one-time offline training for a class of problems and efficient online inference on new instances. Experiments on four representative homotopy problems demonstrate that our method generalizes effectively to unseen instances. It consistently outperforms classical and specialized baselines in efficiency while demonstrating superior stability across tasks, highlighting the value of unifying homotopy methods into a single neural framework.
Why we are recommending this paper?
Due to your Interest in Reinforcement Learning

The paper's focus on the Predictor-Corrector framework, a common approach in optimization, aligns with your interest in robust and efficient learning methods. Applying RL to this established paradigm is a promising direction.
ShanghaiTech University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The paper also discusses the use of sample quantiles as a way to estimate the true quantile of a distribution. (ML: 0.96)👍👎
  • Quantile loss function: A measure of the difference between an estimate and the true quantile of a distribution. (ML: 0.94)👍👎
  • The authors show that the sample quantile is a minimizer of the quantile loss function and that it converges to the true quantile as the number of samples increases. (ML: 0.93)👍👎
  • Sample quantile: An estimator of the true quantile of a distribution, obtained by minimizing the quantile loss function. (ML: 0.92)👍👎
  • They show that the anchor is a consistent estimator of the true quantile and that it has good performance in practice. (ML: 0.86)👍👎
  • Integral-Consistent Discretization: A discretization scheme that eliminates the endpoint bias present in standard Euler discretization. (ML: 0.81)👍👎
  • The paper presents a quantitative analysis of the endpoint bias present in standard Euler discretization under different schedules and number of steps. (ML: 0.80)👍👎
  • Endpoint bias: The systematic discrepancy introduced by standard Euler discretization when approximating the continuous integral of a non-linear drift schedule. (ML: 0.79)👍👎
  • The results show that the Euler method requires a large number of steps to suppress bias to negligible levels, while the proposed Integral-Consistent Discretization mathematically eliminates this error. (ML: 0.75)👍👎
  • The authors introduce a new discretization scheme, called Integral-Consistent Discretization, which eliminates the endpoint bias present in standard Euler discretization. (ML: 0.73)👍👎
  • The paper proposes the use of a Generalized Ornstein-Uhlenbeck (GOU) bridge for continuous-time reinforcement learning. (ML: 0.71)👍👎
  • The GOU bridge is a stochastic process that can be used to model the behavior of an agent in a continuous-time environment. (ML: 0.64)👍👎
  • Generalized Ornstein-Uhlenbeck (GOU) bridge: A stochastic process used to model the behavior of an agent in a continuous-time environment. (ML: 0.62)👍👎
  • This scheme is based on the properties of the GOU bridge and allows for more accurate modeling of the agent's behavior. (ML: 0.58)👍👎
  • The authors provide a detailed analysis of the anchor proposed in the main text, focusing on its formal definition and asymptotic properties. (ML: 0.48)👍👎
Abstract
Recent advances in diffusion-based reinforcement learning (RL) methods have demonstrated promising results in a wide range of continuous control tasks. However, existing works in this field focus on the application of diffusion policies while leaving the diffusion critics unexplored. In fact, since policy optimization fundamentally relies on the critic, accurate value estimation is far more important than policy expressiveness. Furthermore, given the stochasticity of most reinforcement learning tasks, it has been confirmed that the critic is more appropriately depicted with a distributional model. Motivated by these points, we propose a novel distributional RL method with Diffusion Bridge Critics (DBC). DBC directly models the inverse cumulative distribution function (CDF) of the Q value. This allows us to accurately capture the value distribution and prevents it from collapsing into a trivial Gaussian distribution owing to the strong distribution-matching capability of the diffusion bridge. Moreover, we further derive an analytic integral formula to address discretization errors in DBC, which is essential in value estimation. To our knowledge, DBC is the first work to employ the diffusion bridge model as the critic. Notably, DBC is also a plug-and-play component and can be integrated into most existing RL frameworks. Experimental results on MuJoCo robot control benchmarks demonstrate the superiority of DBC compared with previous distributional critic models.
Why we are recommending this paper?
Due to your Interest in Deep Learning for Reinforcement Learning

This work explores diffusion-based reinforcement learning, a rapidly developing area with potential for continuous control tasks – a key area of interest within deep reinforcement learning.
Lexsi Labs
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Policy optimization: The process of finding a policy that maximizes the expected return or reward. (ML: 0.94)👍👎
  • Bregman divergence: A measure of the difference between two probability distributions, used for regularization in policy optimization. (ML: 0.89)👍👎
  • The authors demonstrate that their method, GBMPO, outperforms state-of-the-art methods on code generation and mathematical reasoning tasks. (ML: 0.89)👍👎
  • Dr. (ML: 0.85)👍👎
  • GBMPO variants show improved performance over the base model on both benchmarks, with pass@1 gains of 1.5-2.2 points on MBPP and 1.6-2.0 points on MBPP+ The gap between pass@1 and pass@10 reveals the deterministic nature of learned policies discussed in the main results. (ML: 0.83)👍👎
  • The paper presents a novel approach to policy optimization using Bregman divergences, which allows for more flexible regularization of the policy. (ML: 0.78)👍👎
  • The paper presents a novel approach to policy optimization using Bregman divergences, which allows for more flexible regularization of the policy. (ML: 0.78)👍👎
  • GBMPO (Generalized Bregman Policy Optimization): A novel approach to policy optimization using Bregman divergences, which allows for more flexible regularization of the policy. (ML: 0.69)👍👎
  • GBMPO variants show improved performance over the base model on both benchmarks, with pass@1 gains of 1.5-2.2 points on MBPP and 1.6-2.0 points on MBPP+ (ML: 0.68)👍👎
  • GRPO shows a large gap (5.0 points on MBPP, 3.7 on MBPP+), indicating substantial benefit from sampling multiple solutions. (ML: 0.64)👍👎
Abstract
Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and training dynamics, all existing group-based methods exclusively use KL divergence for policy regularization, leaving the choice of divergence function unexplored. We introduce Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based policy optimization to flexible Bregman divergences, including hand-designed alternatives (L2 in probability space) and learned neural mirror maps. On GSM8K mathematical reasoning, hand-designed ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over the Dr. GRPO baseline. On MBPP code generation, neural mirror maps reach 60.1-60.8% pass@1, with random initialization already capturing most of the benefit. While evolutionary strategies meta-learning provides marginal accuracy improvements, its primary value lies in variance reduction ($\pm$0.2 versus $\pm$0.6) and efficiency gains (15% shorter responses on MBPP), suggesting that random initialization of neural mirror maps is sufficient for most practical applications. These results establish divergence choice as a critical, previously unexplored design dimension in group-based policy optimization for LLM reasoning.
Why we are recommending this paper?
Due to your Interest in Deep Learning for Reinforcement Learning