SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

NovaSky AI

Why we think this paper is great for you:
This paper directly addresses efficient reinforcement learning training for multi-turn agents, specifically within the context of LLMs. It offers insights into practical frameworks for developing advanced RL agents.

Rate paper: 👍 👎 ♥ Save

Abstract
We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.

AI Summary

Train specialized agents (e.g., SA-SWE-32B) purely with RL to achieve state-of-the-art performance (39.4% Pass@1 on SWE-Bench Verified) with over 2x cost reduction compared to prior models of similar scale. [3]
Implement an optimized asynchronous pipeline dispatcher to achieve a 1.55x speedup in multi-turn RL agent training by overlapping CPU-bound and GPU-bound operations, maintaining ~90% GPU utilization. [2]
Adopt a transition-based data representation to mitigate inference-training engine mismatch, guarantee token-level fidelity, and enable greater algorithmic flexibility for diverse RL algorithms beyond mask-based approaches. [2]
Design a modular framework with a tool-centric task interface, fine-grained dispatcher, and backend bridge to enable seamless integration of new tasks/tools and flexible interoperability with various RL training systems. [2]
Utilize contextual hints during multi-turn RL training to help agents recover from failed actions and re-enter valid trajectories, substantially improving trajectory quality and stabilizing rollout collection. [2]
Demonstrate that agents trained on specific long-horizon, multi-tool tasks (like SWE) can effectively generalize to other agentic benchmarks (Terminal-Bench, BrowseComp-Plus, WebArena), indicating transferable tool-use competence. [2]
SKYRL-AGENT: A framework for efficient, multi-turn, long-horizon agent training and evaluation, providing asynchronous dispatching, lightweight tool integration, and flexible backend interoperability. [2]
Asynchronous Pipeline Dispatcher: An optimized scheduling strategy within SKYRL-AGENT that decomposes rollouts into stages (initialization, agent run, reward calculation) and overlaps CPU-bound and GPU-bound operations to maximize hardware utilization. [2]
Tool-enhanced Training Recipe: A methodology that integrates advanced tools, such as an AST-based search tool with contextual hints, into the agent's environment to facilitate navigation and improve learning efficiency in RL training. [2]
Leverage an AST-based search tool within a tool-enhanced training recipe to significantly improve code navigation, boost rollout Pass@K, and enhance sample efficiency in software engineering tasks. [1]

Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization

NYCU

Why we think this paper is great for you:
This research focuses on designing human-like reinforcement learning agents, a key area within agentic RL. It explores novel methods for agent design, which is highly relevant to your interest in advanced RL agent development.

Rate paper: 👍 👎 ♥ Save

Abstract
Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.

GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning

Northwestern Polytechnic

Why we think this paper is great for you:
You will find this paper highly relevant as it explores a reinforcement learning method for fine-tuning representation models, including large language models. This directly connects to your interest in deep learning applied to reinforcement learning.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.

P1: Mastering Physics Olympiads with Reinforcement Learning

Shanghai AI Laboratory

Why we think this paper is great for you:
This paper demonstrates the application of reinforcement learning to complex, science-grade reasoning problems, leveraging large language models. It showcases the power of RL in challenging domains.

Rate paper: 👍 👎 ♥ Save

Abstract
Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.

Reinforcement Learning in Queue-Reactive Models: Application to Optimal Execution

Princeton University

Why we think this paper is great for you:
This work applies reinforcement learning to a practical problem of optimal execution in financial markets. It provides a concrete example of how reinforcement learning can be used to solve real-world optimization challenges.

Rate paper: 👍 👎 ♥ Save

Abstract
We investigate the use of Reinforcement Learning for the optimal execution of meta-orders, where the objective is to execute incrementally large orders while minimizing implementation shortfall and market impact over an extended period of time. Departing from traditional parametric approaches to price dynamics and impact modeling, we adopt a model-free, data-driven framework. Since policy optimization requires counterfactual feedback that historical data cannot provide, we employ the Queue-Reactive Model to generate realistic and tractable limit order book simulations that encompass transient price impact, and nonlinear and dynamic order flow responses. Methodologically, we train a Double Deep Q-Network agent on a state space comprising time, inventory, price, and depth variables, and evaluate its performance against established benchmarks. Numerical simulation results show that the agent learns a policy that is both strategic and tactical, adapting effectively to order book conditions and outperforming standard approaches across multiple training configurations. These findings provide strong evidence that model-free Reinforcement Learning can yield adaptive and robust solutions to the optimal execution problem.

Help us improve your experience!