Deep Learning for Reinforcement Learning

University of North Carol

Abstract
We develop a deep learning algorithm for approximating functional rational expectations equilibria of dynamic stochastic economies in the sequence space. We use deep neural networks to parameterize equilibrium objects of the economy as a function of truncated histories of exogenous shocks. We train the neural networks to fulfill all equilibrium conditions along simulated paths of the economy. To illustrate the performance of our method, we solve three economies of increasing complexity: the stochastic growth model, a high-dimensional overlapping generations economy with multiple sources of aggregate risk, and finally an economy where households and firms face uninsurable idiosyncratic risk, shocks to aggregate productivity, and shocks to idiosyncratic and aggregate volatility. Furthermore, we show how to design practical neural policy function architectures that guarantee monotonicity of the predicted policies, facilitating the use of the endogenous grid method to simplify parts of our algorithm.

AI Insights

Equilibrium objects are parameterized as functions of truncated exogenous shock histories, enabling temporal learning.
Training enforces all equilibrium conditions along simulated paths, avoiding iterative solves.
Monotonicity is guaranteed by neural architectures, simplifying endogenous grid use.
The method scales to high‑dimensional overlapping‑generations models with multiple aggregate risks.
A test case adds idiosyncratic risk, productivity shocks, and stochastic volatility, showing robustness.
The paper reviews numerical, projection, and machine‑learning methods, positioning deep learning as a unifying framework.
Suggested readings: Judd’s Numerical Methods in Economics and Druedahl & Ropke’s endogenous grid papers.

👍 👎 ♥ Save

GRATE: a Graph transformer-based deep Reinforcement learning Approach for Time-efficient autonomous robot Exploration

National University of

Abstract
Autonomous robot exploration (ARE) is the process of a robot autonomously navigating and mapping an unknown environment. Recent Reinforcement Learning (RL)-based approaches typically formulate ARE as a sequential decision-making problem defined on a collision-free informative graph. However, these methods often demonstrate limited reasoning ability over graph-structured data. Moreover, due to the insufficient consideration of robot motion, the resulting RL policies are generally optimized to minimize travel distance, while neglecting time efficiency. To overcome these limitations, we propose GRATE, a Deep Reinforcement Learning (DRL)-based approach that leverages a Graph Transformer to effectively capture both local structure patterns and global contextual dependencies of the informative graph, thereby enhancing the model's reasoning capability across the entire environment. In addition, we deploy a Kalman filter to smooth the waypoint outputs, ensuring that the resulting path is kinodynamically feasible for the robot to follow. Experimental results demonstrate that our method exhibits better exploration efficiency (up to 21.5% in distance and 21.3% in time to complete exploration) than state-of-the-art conventional and learning-based baselines in various simulation benchmarks. We also validate our planner in real-world scenarios.

AI Insights

The graph transformer backbone is built upon Graph Attention Networks, enabling efficient message passing across nodes.
Training the model demands a large, diverse dataset of exploration trajectories, which can be costly to collect.
The inference pipeline is computationally heavy, limiting its deployment in strict real‑time robotic systems.
The authors employ a Soft Actor‑Critic policy to balance exploration and exploitation under uncertainty.
Ablation studies show that removing the transformer attention drops exploration efficiency by ~12 %.
The approach was benchmarked on both simulated and physical indoor environments, demonstrating robustness to sensor noise.
Future work proposes distilling the transformer into a lightweight policy network for edge deployment.

Agentic RL

👍 👎 ♥ Save

Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use

Harbin Institute of Techn

Abstract
Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10\% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.

👍 👎 ♥ Save

Redefining CX with Agentic AI: Minerva CQ Case Study

Minerva CQ, California, U

Abstract
Despite advances in AI for contact centers, customer experience (CX) continues to suffer from high average handling time (AHT), low first-call resolution, and poor customer satisfaction (CSAT). A key driver is the cognitive load on agents, who must navigate fragmented systems, troubleshoot manually, and frequently place customers on hold. Existing AI-powered agent-assist tools are often reactive driven by static rules, simple prompting, or retrieval-augmented generation (RAG) without deeper contextual reasoning. We introduce Agentic AI goal-driven, autonomous, tool-using systems that proactively support agents in real time. Unlike conventional approaches, Agentic AI identifies customer intent, triggers modular workflows, maintains evolving context, and adapts dynamically to conversation state. This paper presents a case study of Minerva CQ, a real-time Agent Assist product deployed in voice-based customer support. Minerva CQ integrates real-time transcription, intent and sentiment detection, entity recognition, contextual retrieval, dynamic customer profiling, and partial conversational summaries enabling proactive workflows and continuous context-building. Deployed in live production, Minerva CQ acts as an AI co-pilot, delivering measurable improvements in agent efficiency and customer experience across multiple deployments.

AI Insights

Minerva CQ’s agentic AI auto‑triggers tools mid‑conversation, turning calls into dynamic workflows.
Real‑time summarization keeps a rolling context, letting agents skip manual KB lookups.
Intent, sentiment, and entities are first‑class, cutting cognitive load and AHT by up to 30 %.
Selective retrieval pulls only the most relevant docs, preventing the overload that plagues RAG assistants.
Deployments show a 15 % lift in first‑call resolution, proving proactive co‑piloting beats reactive rules.
The study spotlights a shift to “context‑aware co‑piloting,” where AI shapes outcomes in real time.
For deeper dives, read “Generative AI at scale: The productivity effects of real‑world agent assist” and EMNLP 2023 Industry Track.

Reinforcement Learning

👍 👎 ♥ Save

Reinforcement Learning Agent for a 2D Shooter Game

Karlsruhe Institute of

Abstract
Reinforcement learning agents in complex game environments often suffer from sparse rewards, training instability, and poor sample efficiency. This paper presents a hybrid training approach that combines offline imitation learning with online reinforcement learning for a 2D shooter game agent. We implement a multi-head neural network with separate outputs for behavioral cloning and Q-learning, unified by shared feature extraction layers with attention mechanisms. Initial experiments using pure deep Q-Networks exhibited significant instability, with agents frequently reverting to poor policies despite occasional good performance. To address this, we developed a hybrid methodology that begins with behavioral cloning on demonstration data from rule-based agents, then transitions to reinforcement learning. Our hybrid approach achieves consistently above 70% win rate against rule-based opponents, substantially outperforming pure reinforcement learning methods which showed high variance and frequent performance degradation. The multi-head architecture enables effective knowledge transfer between learning modes while maintaining training stability. Results demonstrate that combining demonstration-based initialization with reinforcement learning optimization provides a robust solution for developing game AI agents in complex multi-agent environments where pure exploration proves insufficient.

👍 👎 ♥ Save

$Agent^2$: An Agent-Generates-Agent Framework for Reinforcement Learning Automation

Qiyuan Lab, Beijing,China

Abstract
Reinforcement learning agent development traditionally requires extensive expertise and lengthy iterations, often resulting in high failure rates and limited accessibility. This paper introduces $Agent^2$, a novel agent-generates-agent framework that achieves fully automated RL agent design through intelligent LLM-driven generation. The system autonomously transforms natural language task descriptions and environment code into comprehensive, high-performance reinforcement learning solutions without human intervention. $Agent^2$ features a revolutionary dual-agent architecture. The Generator Agent serves as an autonomous AI designer that analyzes tasks and generates executable RL agents, while the Target Agent is the resulting automatically generated RL agent. The framework decomposes RL development into two distinct stages: MDP modeling and algorithmic optimization, enabling more targeted and effective agent generation. Built on the Model Context Protocol, $Agent^2$ provides a unified framework that standardizes intelligent agent creation across diverse environments and algorithms, while incorporating adaptive training management and intelligent feedback analysis for continuous improvement. Extensive experiments on a wide range of benchmarks, including MuJoCo, MetaDrive, MPE, and SMAC, demonstrate that $Agent^2$ consistently outperforms manually designed solutions across all tasks, achieving up to 55% performance improvement and substantial gains on average. By enabling truly end-to-end, closed-loop automation, this work establishes a new paradigm in which intelligent agents design and optimize other agents, marking a fundamental breakthrough for automated AI systems.

AI Insights

Dual‑agent design splits RL into MDP modeling and algorithmic optimization for targeted synthesis.
$Agent^2$ uses a Model Context Protocol to standardize environment‑to‑policy translation across simulators.
Adaptive training and real‑time feedback let the Generator Agent refine policies autonomously.
Benchmarks (MuJoCo, MetaDrive, MPE, SMAC) show up to 55 % gains over hand‑crafted baselines.
Performance hinges on LLM fidelity; language errors can produce suboptimal policies.
High computational cost limits use to offline or high‑performance clusters, not edge devices.
Read “Survey on LLM‑Enhanced RL” for taxonomy and “Tianshou” for modular RL libraries.

Help us improve your experience!