Deep Learning for Reinforcement Learning

Deep Reinforcement Learning for Dynamic Sensing and Communications

University of Oulu, CWC

Rate this image: 😍 👍 👎

Abstract
Environmental sensing can significantly enhance mmWave communications by assisting beam training, yet its benefits must be balanced against the associated sensing costs. To this end, we propose a unified machine learning framework that dynamically determines when to sense and leverages sensory data for beam prediction. Specifically, we formulate a joint sensing and beamforming problem that maximizes the av- erage signal-to-noise ratio under an average sensing budget. Lyapunov optimization is employed to enforce the sensing constraint, while a deep Q-Network determines the sensing slots. A pretrained deep neural network then maps the sens- ing data to optimal beams in the codebook. Simulations based on the real-world DeepSense dataset demonstrate that the pro- posed approach substantially reduces sensing overhead while maintaining satisfactory communications performance.

AI Insights

The cost function blends cross‑entropy loss with Lyapunov drift, tightening the average‑sensing budget.
By feeding the age of the latest sensing sample into the DQN state, the policy learns to skip stale data and boost beam accuracy.
Experiments on DeepSense Scenario 5 show a 48 % drop in sensing overhead while preserving 93 % of the full‑sensing beam‑prediction accuracy.
When the Lyapunov weight α reaches 0.5, the algorithm asymptotically matches full‑sensing performance, proving the trade‑off’s tightness.
The virtual‑queue mechanism guarantees the long‑term sensing budget without hard per‑slot limits, a key novelty over prior static schemes.
A pretrained DNN maps raw sensing features to the beam‑codebook, decoupling perception from decision‑making and speeding inference.
For deeper dives, Neely’s “Stochastic Network Optimization” and Zakeri et al.’s AoI‑minimization paper provide foundational theory.

👍 👎 ♥ Save

DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents

Abstract
Travel planning (TP) agent has recently worked as an emerging building block to interact with external tools and resources for travel itinerary generation, ensuring enjoyable user experience. Despite its benefits, existing studies rely on hand craft prompt and fixed agent workflow, hindering more flexible and autonomous TP agent. This paper proposes DeepTravel, an end to end agentic reinforcement learning framework for building autonomous travel planning agent, capable of autonomously planning, executing tools, and reflecting on tool responses to explore, verify, and refine intermediate actions in multi step reasoning. To achieve this, we first construct a robust sandbox environment by caching transportation, accommodation and POI data, facilitating TP agent training without being constrained by real world APIs limitations (e.g., inconsistent outputs). Moreover, we develop a hierarchical reward modeling system, where a trajectory level verifier first checks spatiotemporal feasibility and filters unsatisfied travel itinerary, and then the turn level verifier further validate itinerary detail consistency with tool responses, enabling efficient and precise reward service. Finally, we propose the reply augmented reinforcement learning method that enables TP agent to periodically replay from a failures experience buffer, emerging notable agentic capacity. We deploy trained TP agent on DiDi Enterprise Solutions App and conduct comprehensive online and offline evaluations, demonstrating that DeepTravel enables small size LLMs (e.g., Qwen3 32B) to significantly outperform existing frontier LLMs such as OpenAI o1, o3 and DeepSeek R1 in travel planning tasks.

Agentic RL

👍 👎 ♥ Save

Teaching RL Agents to Act Better: VLM as Action Advisor for Online Reinforcement Learning

Wuhan University, China

Abstract
Online reinforcement learning in complex tasks is time-consuming, as massive interaction steps are needed to learn the optimal Q-function.Vision-language action (VLA) policies represent a promising direction for solving diverse tasks; however, their performance on low-level control remains limited, and effective deployment often requires task-specific expert demonstrations for fine-tuning. In this paper, we propose \textbf{VARL} (\textbf{V}LM as \textbf{A}ction advisor for online \textbf{R}einforcement \textbf{L}earning), a framework that leverages the domain knowledge of vision-language models (VLMs) to provide action suggestions for reinforcement learning agents. Unlike previous methods, VARL provides action suggestions rather than designing heuristic rewards, thereby guaranteeing unchanged optimality and convergence. The suggested actions increase sample diversity and ultimately improve sample efficiency, especially in sparse-reward tasks. To validate the effectiveness of VARL, we evaluate it across diverse environments and agent settings. Results show that VARL greatly improves sample efficiency without introducing significant computational overhead. These advantages make VARL a general framework for online reinforcement learning and make it feasible to directly apply reinforcement learning from scratch in real-world environments.

AI Insights

VARL leverages VLMs to suggest actions, preserving the optimal policy and guaranteeing convergence.
The framework eliminates the need for task‑specific expert demonstrations, enabling zero‑shot RL in new domains.
Empirical results show a dramatic boost in sample diversity, especially in sparse‑reward scenarios.
The authors provide a theoretical analysis proving that action‑advisor guidance does not alter the optimal Q‑function.
VARL’s computational overhead is negligible, making it practical for real‑world deployments from scratch.
The paper situates its contribution within a growing body of VLM‑RL work, citing Soft Actor‑Critic, Meta‑World+, AI2‑THOR, and ConRFT.
Limitations are candidly discussed: the approach may struggle on tasks with extremely high dimensional action spaces and demands substantial VLM compute.

👍 👎 ♥ Save

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

Abstract
Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.

Reinforcement Learning

👍 👎 ♥ Save

Evaluation-Aware Reinforcement Learning

University of Massachusst

Rate this image: 😍 👍 👎

Abstract
Policy evaluation is often a prerequisite for deploying safety- and performance-critical systems. Existing evaluation approaches frequently suffer from high variance due to limited data and long-horizon tasks, or high bias due to unequal support or inaccurate environmental models. We posit that these challenges arise, in part, from the standard reinforcement learning (RL) paradigm of policy learning without explicit consideration of evaluation. As an alternative, we propose evaluation-aware reinforcement learning (EvA-RL), in which a policy is trained to maximize expected return while simultaneously minimizing expected evaluation error under a given value prediction scheme -- in other words, being "easy" to evaluate. We formalize a framework for EvA-RL and design an instantiation that enables accurate policy evaluation, conditioned on a small number of rollouts in an assessment environment that can be different than the deployment environment. However, our theoretical analysis and empirical results show that there is often a tradeoff between evaluation accuracy and policy performance when using a fixed value-prediction scheme within EvA-RL. To mitigate this tradeoff, we extend our approach to co-learn an assessment-conditioned state-value predictor alongside the policy. Empirical results across diverse discrete and continuous action domains demonstrate that EvA-RL can substantially reduce evaluation error while maintaining competitive returns. This work lays the foundation for a broad new class of RL methods that treat reliable evaluation as a first-class principle during training.

AI Insights

EvA‑RL embeds a learned value predictor into the policy‑gradient loss, penalizing evaluation error during training.
A behavioral encoding captures trajectory statistics, enabling accurate performance estimation in a separate assessment environment.
When the predictor’s predictability coefficient exceeds 0.8, EvA‑RL outperforms vanilla policy gradients in return.
Predictor training adds computational overhead, so hyper‑parameter tuning is critical for speed‑accuracy balance.
Co‑learning the policy with an assessment‑conditioned value network mitigates the return‑evaluation trade‑off.
Deep Reinforcement Learning: A Brief Introduction offers foundational concepts that complement EvA‑RL’s framework.
Proximal Policy Optimization provides a stable baseline to benchmark EvA‑RL’s performance gains.

👍 👎 ♥ Save

Reinforcement Learning on Pre-Training Data

Tencent, The Chinese Unv

Abstract
The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

AI Insights

RLPT turns pre‑training corpora into a self‑rewarding playground, letting the model chase its own next‑segment predictions.
By forgoing human annotators, RLPT slashes reward‑engineering costs while keeping the signal fresh and data‑rich.
The next‑segment objective turns every token into a mini‑quiz, pushing the policy to explore longer, richer trajectories.
On Qwen3‑4B‑Base, RLPT boosts MMLU by up to 8.1 points, proving self‑reward can outpace supervised fine‑tuning.
RLPT’s scaling curve is steep—more compute yields more reasoning power, hinting at larger gains for future models.
The framework dovetails with RLVR, unifying self‑reward and human‑feedback fine‑tuning.
RLPT lets LLMs learn from their own data, sparking curiosity‑driven exploration without external guidance.

Help us improve your experience!