Hi!

Your personalized paper recommendations for 26 to 30 January, 2026.

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

LianjiaTech

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

ASTRA's ability to learn effective tool discrimination is crucial for its success, and the model's design encourages this through irrelevant-tool mixing during reinforcement learning. (ML: 0.95)👍👎
ASTRA: A large-scale language model that uses a novel architecture and training objectives to improve tool-use behavior in dialogue systems. (ML: 0.93)👍👎
τ2-Bench: A benchmark for evaluating agentic tool-use behavior in dialogue systems. (ML: 0.92)👍👎
ACEBench: A benchmark for evaluating agentic tool-use behavior in dialogue systems. (ML: 0.92)👍👎
ASTRA is a large-scale language model that uses a novel architecture and training objectives to improve tool-use behavior in dialogue systems. (ML: 0.92)👍👎
BFCL-MT: A benchmark for evaluating agentic tool-use behavior in dialogue systems. (ML: 0.91)👍👎
RL: Reinforcement Learning, the training objective used to fine-tune ASTRA after SFT. (ML: 0.89)👍👎
The model's performance on agentic benchmarks such as BFCL-MT, τ2-Bench, and ACEBench is competitive with higher-parameter open-source and closed-source models. (ML: 0.87)👍👎
SFT: Self-Training with Fine-Tuning, a pre-training method used to improve ASTRA's performance on agentic benchmarks. (ML: 0.87)👍👎
The F1-style trajectory reward used in ASTRA's training process jointly optimizes task completion and interaction efficiency, leading to robust multi-step performance. (ML: 0.87)👍👎

Abstract
Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

Why we are recommending this paper?
Due to your Interest in Agentic RL

This paper directly addresses the training of robust tool-using agents, a critical area given your interest in reinforcement learning and deep learning for agents. The focus on automated synthesis and arenas aligns well with developing sophisticated agentic RL systems.

Reward Engineering for Reinforcement Learning in Software Tasks

University of California

Rate paper: 👍 👎 ♥ Save

AI Insights

While there are many challenges associated with applying RL in SE, the potential benefits make it an exciting area of research and development. (ML: 0.98)👍👎
RL models may not generalize well to new environments or tasks without sufficient training data. (ML: 0.98)👍👎
The application of RL in SE is rapidly evolving, with researchers pushing the boundaries of what can be achieved through this approach. (ML: 0.97)👍👎
Reward design is a critical aspect of RL in SE, with researchers investigating different reward functions and mechanisms to encourage desirable behavior in software development processes. (ML: 0.97)👍👎
Reward design is a critical aspect of RL in SE, but designing effective reward functions can be challenging. (ML: 0.97)👍👎
Reinforcement Learning (RL): A subfield of machine learning that involves training agents to take actions in an environment to maximize a cumulative reward. (ML: 0.97)👍👎
Researchers are exploring various techniques including deep reinforcement learning, transfer learning, and multi-task learning to improve the efficiency and effectiveness of RL in SE. (ML: 0.95)👍👎
Researchers are exploring various techniques including deep reinforcement learning, transfer learning, and multi-task learning to improve the efficiency and effectiveness of RL in SE. (ML: 0.95)👍👎
Reinforcement learning is being increasingly applied in software engineering for tasks such as code generation, repair, and summarization. (ML: 0.94)👍👎
Software Engineering (SE): The application of engineering principles to the design, development, testing, and maintenance of software systems. (ML: 0.94)👍👎

Abstract
Reinforcement learning is increasingly used for code-centric tasks. These tasks include code generation, summarization, understanding, repair, testing, and optimization. This trend is growing faster with large language models and autonomous agents. A key challenge is how to design reward signals that make sense for software. In many RL problems, the reward is a clear number. In software, this is often not possible. The goal is rarely a single numeric objective. Instead, rewards are usually proxies. Common proxies check if the code compiles, passes tests, or satisfies quality metrics. Many reward designs have been proposed for code-related tasks. However, the work is scattered across areas and papers. There is no single survey that brings these approaches together and shows the full landscape of reward design for RL in software. In this survey, we provide the first systematic and comprehensive review of reward engineering for RL in software tasks. We focus on existing methods and techniques. We structure the literature along three complementary dimensions, summarizing the reward-design choices within each. We conclude with challenges and recommendations in the reward design space for SE tasks.

Why we are recommending this paper?
Due to your Interest in Reinforcement Learning

Given your interest in reinforcement learning for software tasks, this paper’s exploration of reward signal design is highly relevant. It tackles a fundamental challenge in applying RL to code-centric problems.

How do Agents Refactor: An Empirical Study

University of Alberta

Rate paper: 👍 👎 ♥ Save

AI Insights

The study focuses solely on Java projects, which may not be representative of other programming languages or ecosystems. (ML: 0.98)👍👎
Code agents primarily perform annotation related edits rather than structural refactoring. (ML: 0.96)👍👎
The introduction of code smells across pull requests by agents was statistically insignificant compared to humans, with the exception of Cursor. (ML: 0.96)👍👎
Code smells: Defects or weaknesses in the design of software that can make it harder to maintain and modify. (ML: 0.94)👍👎
While code agents can autonomously perform refactorings, their refactoring commits remain limited in their ability to improve code quality, introducing code smells at a similar rate as developers. (ML: 0.94)👍👎
Agentic refactorings have limitations in improving code quality. (ML: 0.92)👍👎
Agentic refactorings: Refactorings performed by autonomous coding agents. (ML: 0.92)👍👎
The study highlights the need for future research into agentic refactoring strategies that not only automate changes but also improve code quality through a more diverse and structural set of refactorings. (ML: 0.91)👍👎
Pull requests: A feature of version control systems like Git, allowing multiple developers to review and discuss changes before they are merged into a shared codebase. (ML: 0.89)👍👎
Agentic refactorings are often different from developer refactorings in terms of refactoring actions. (ML: 0.87)👍👎

Abstract
Software development agents such as Claude Code, GitHub Copilot, Cursor Agent, Devin, and OpenAI Codex are being increasingly integrated into developer workflows. While prior work has evaluated agent capabilities for code completion and task automation, there is little work investigating how these agents perform Java refactoring in practice, the types of changes they make, and their impact on code quality. In this study, we present the first analysis of agentic refactoring pull requests in Java, comparing them to developer refactorings across 86 projects per group. Using RefactoringMiner and DesigniteJava 3.0, we identify refactoring types and detect code smells before and after refactoring commits. Our results show that agent refactorings are dominated by annotation changes (the 5 most common refactoring types done by agents are annotation related), in contrast to the diverse structural improvements typical of developers. Despite these differences in refactoring types, we find Cursor to be the only model to show a statistically significant increase in refactoring smells.

Why we are recommending this paper?
Due to your Interest in Agentic RL

This research investigates the practical behavior of software agents, a key area within agentic RL. The empirical study offers valuable insights into how these agents are currently being utilized.

The Surprising Difficulty of Search in Model-Based Reinforcement Learning

Meta FAIR

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

For example, some methods use techniques like hierarchical planning or learning to reduce the complexity of the search space. (ML: 0.97)👍👎
Search algorithm: An algorithm used to find the best policy or action given a set of possible policies or actions. (ML: 0.93)👍👎
A key challenge in MBRL is the difficulty of searching over the space of possible actions and policies. (ML: 0.90)👍👎
Model-based reinforcement learning (MBRL): A type of reinforcement learning that uses a model of the environment to plan and make decisions. (ML: 0.90)👍👎
The Surprising Difficulty of Search in Model-Based Reinforcement Learning Model-based reinforcement learning (MBRL) has been shown to be effective in various tasks, but it often relies on search algorithms that can be computationally expensive and difficult to scale. (ML: 0.88)👍👎
This is because the number of possible actions and policies grows exponentially with the size of the state space, making it impractical to explore all possibilities. (ML: 0.88)👍👎
Value function: A function that estimates the expected return or reward for taking a particular action in a given state. (ML: 0.87)👍👎
Despite these challenges, researchers have made significant progress in developing more efficient and effective search algorithms for MBRL. (ML: 0.80)👍👎
The development of more efficient search algorithms is crucial for the success of MBRL in real-world applications, where computational resources are often limited. (ML: 0.78)👍👎
Recent work has focused on developing more efficient search algorithms for MBRL, such as using neural networks to approximate the value function or using techniques like Monte Carlo tree search to reduce the computational cost. (ML: 0.74)👍👎
The difficulty of search in MBRL is often attributed to the curse of dimensionality, which makes it challenging to represent and explore high-dimensional state spaces. (ML: 0.69)👍👎

Abstract
This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.

Why we are recommending this paper?
Due to your Interest in Reinforcement Learning

This paper challenges conventional approaches to model-based RL, focusing on the difficulty of search – a core component of many RL algorithms. Understanding these limitations is crucial for developing more effective RL systems.

Acquiring Human-Like Mechanics Intuition from Scarce Observations via Deep Reinforcement Learning

Zhejiang University

Rate paper: 👍 👎 ♥ Save

AI Insights

Reinforcement learning: A type of machine learning where an agent learns to take actions in an environment to maximize a reward signal. (ML: 0.97)👍👎
The paper presents a novel approach to developing artificial mechanics intuitions from extremely small data. (ML: 0.92)👍👎
Artificial mechanics intuitions: The ability of artificial systems to understand and reason about physical phenomena in a way that is similar to humans. (ML: 0.92)👍👎
The proposed method involves training a neural network to predict the behavior of a system based on its input parameters, and then using this model to optimize the system's design or operation. (ML: 0.90)👍👎
The proposed method requires large amounts of computational resources and data to train the neural network. (ML: 0.88)👍👎
The proposed method has the potential to revolutionize the field of artificial intelligence by enabling machines to learn and reason about complex physical phenomena in a way that is similar to humans. (ML: 0.88)👍👎
Physics-based models: Mathematical models that describe the behavior of physical systems, such as mechanical systems. (ML: 0.88)👍👎
The approach may not be suitable for all types of physical systems, particularly those with complex or nonlinear dynamics. (ML: 0.86)👍👎
The authors propose using reinforcement learning to learn physics-based models of mechanical systems, which can be used for tasks such as predicting the behavior of complex systems and optimizing their performance. (ML: 0.85)👍👎
The approach can be used for a wide range of applications, including robotics, materials science, and engineering design optimization. (ML: 0.76)👍👎

Abstract
Humans can infer accurate mechanical outcomes from only a few observations, a capability known as mechanics intuition. The mechanisms behind such data-efficient learning remain unclear. Here, we propose a reinforcement learning framework in which an agent encodes continuous physical observation parameters into its state and is trained via episodic switching across closely related observations. With merely two or three observations, the agent acquires robust mechanics intuition that generalizes accurately over wide parameter ranges, substantially beyond the training data, as demonstrated on the brachistochrone and a large-deformation elastic plate. We explain this generalization through a unified theoretical view: it emerges when the learned value function enforces Bellman consistency across neighboring task parameters, rendering the Bellman residual stationary with respect to physical variations. This induces a smooth policy that captures a low-dimensional solution manifold underlying the continuum of tasks. Our work establishes episodic switching as a principled route to artificial mechanics intuition and offers a theoretical link to similar generalization abilities in biological learners.

Why we are recommending this paper?
Due to your Interest in Deep Learning for Reinforcement Learning

The focus on data-efficient learning and mimicking human intuition in physical systems aligns with your interest in reinforcement learning. This work explores a novel approach to learning complex behaviors from limited data.

Reinforcement Unlearning via Group Relative Policy Optimization

Technical University of Munich

Rate paper: 👍 👎 ♥ Save

AI Insights

The authors discuss the implications of their work for the development of more robust and reliable pre-trained language models. (ML: 0.97)👍👎
Total-variation distance: A measure of the largest difference in probability assigned by two distributions to any event. (ML: 0.94)👍👎
KL-penalty: A penalty term in the GRPO objective function that bounds the average per-prompt KL divergence between the fine-tuned policy and the original policy. (ML: 0.88)👍👎
Forbidden tokens: Tokens that are not allowed to be emitted by a language model. (ML: 0.87)👍👎
The authors propose a novel approach called PURGE for unlearning forbidden tokens in pre-trained language models. (ML: 0.85)👍👎
They also show that the average task loss of GRPO converges to that of the optimal retain-only model at the standard O(1/√T) rate. (ML: 0.85)👍👎
The authors propose a novel approach for unlearning forbidden tokens in pre-trained language models, which is effective in reducing leakage while preserving performance on retained tasks. (ML: 0.84)👍👎
The authors prove that GRPO converges to a policy that has zero probability of emitting forbidden tokens in the limit as the number of iterations increases, under certain assumptions. (ML: 0.69)👍👎
The authors provide an empirical evaluation of PURGE on several benchmarks and demonstrate its effectiveness in reducing forbidden token leakage while preserving performance on retained tasks. (ML: 0.63)👍👎
They introduce a new algorithm, GRPO, which is an extension of the Proximal Policy Optimization (PPO) algorithm with a modified objective function that includes a penalty term to reduce the probability of emitting forbidden tokens. (ML: 0.60)👍👎
The proposed algorithm, GRPO, converges to a policy with zero probability of emitting forbidden tokens and has a standard O(1/√T) convergence rate. (ML: 0.59)👍👎
The empirical evaluation demonstrates the effectiveness of PURGE on several benchmarks, showing that it can reduce forbidden token leakage by up to 90% while preserving performance on retained tasks. (ML: 0.59)👍👎
GRPO: A modified version of the Proximal Policy Optimization (PPO) algorithm with a penalty term to reduce the probability of emitting forbidden tokens. (ML: 0.57)👍👎

Abstract
During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach reduces token usage per target by up to a factor of 46 compared with SotA methods, while improving fluency by 5.48 percent and adversarial robustness by 12.02 percent over the base model. On the Real World Knowledge Unlearning (RWKU) benchmark, PURGE achieves 11 percent unlearning effectiveness while preserving 98 percent of original utility. PURGE shows that framing LLM unlearning as a verifiable task, enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.

Why we are recommending this paper?
Due to your Interest in Deep Learning for Reinforcement Learning

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback