Deep Learning for Reinforcement Learning

Heterogeneous RBCs via deep multi-agent reinforcement learning

Sapienza Universit di R

Rate this image: 😍 👍 👎

Abstract
Current macroeconomic models with agent heterogeneity can be broadly divided into two main groups. Heterogeneous-agent general equilibrium (GE) models, such as those based on Heterogeneous Agents New Keynesian (HANK) or Krusell-Smith (KS) approaches, rely on GE and 'rational expectations', somewhat unrealistic assumptions that make the models very computationally cumbersome, which in turn limits the amount of heterogeneity that can be modelled. In contrast, agent-based models (ABMs) can flexibly encompass a large number of arbitrarily heterogeneous agents, but typically require the specification of explicit behavioural rules, which can lead to a lengthy trial-and-error model-development process. To address these limitations, we introduce MARL-BC, a framework that integrates deep multi-agent reinforcement learning (MARL) with Real Business Cycle (RBC) models. We demonstrate that MARL-BC can: (1) recover textbook RBC results when using a single agent; (2) recover the results of the mean-field KS model using a large number of identical agents; and (3) effectively simulate rich heterogeneity among agents, a hard task for traditional GE approaches. Our framework can be thought of as an ABM if used with a variety of heterogeneous interacting agents, and can reproduce GE results in limit cases. As such, it is a step towards a synthesis of these often opposed modelling paradigms.

👍 👎 ♥ Save

Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning

Nanyang Technical, Mila

Abstract
Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL), where larger models often degrade performance due to unique optimization pathologies such as plasticity loss. While recent works show that dynamically adapting network topology during training can mitigate these issues, existing studies have three critical limitations: (1) applying uniform dynamic training strategies across all modules despite encoder, critic, and actor following distinct learning paradigms, (2) focusing evaluation on basic architectures without clarifying the relative importance and interaction between dynamic training and architectural improvements, and (3) lacking systematic comparison between different dynamic approaches including sparse-to-sparse, dense-to-sparse, and sparse-to-dense. Through comprehensive investigation across modules and architectures, we reveal that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements. We finally distill these insights into Module-Specific Training (MST), a practical framework that further exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.

Agentic RL

👍 👎 ♥ Save

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Tencent, Peking Univeris

Rate this image: 😍 👍 👎

Abstract
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

AI Insights

ReLook uses GRPO to embed visual feedback directly into policy updates.
A lightweight distillation loss compresses MLLM feedback tokens into a concise reward signal.
Forced optimization only accepts strictly improving revisions, ensuring monotonic progress.
A zero‑reward penalty for invalid renders anchors renderability and thwarts reward hacking.
At inference the critic is dropped, enabling a fast, critic‑free self‑edit loop with near‑baseline latency.
Results show marked drops in layout drift, interaction errors, and visual fidelity gaps versus baselines.
Qualitative tests confirm ReLook consistently outputs executable, visually faithful front‑end code.

👍 👎 ♥ Save

Agentic Discovery: Closing the Loop with Cooperative Agents

University of Chicago,Arg

Abstract
As data-driven methods, artificial intelligence (AI), and automated workflows accelerate scientific tasks, we see the rate of discovery increasingly limited by human decision-making tasks such as setting objectives, generating hypotheses, and designing experiments. We postulate that cooperative agents are needed to augment the role of humans and enable autonomous discovery. Realizing such agents will require progress in both AI and infrastructure.

AI Insights

Autonomous agents will replace bottleneck steps in discovery workflows within 5–10 years, accelerating breakthroughs.
The modular ACTOR formalism coordinates heterogeneous agents across simulation, ML, and human expertise.
Federated agents enable privacy‑preserving collaboration across institutions, essential for open‑source discovery.
Security, safety, and bias risks are critical challenges that demand transparent governance and cross‑disciplinary partnerships.
A GenAI‑driven workflow discovered novel carbon‑capture materials, demonstrating cooperative agents’ impact.
Autonomous agents: AI systems that operate independently to achieve specific scientific goals.
Recommended reading: “The Fourth Paradigm” and the MOFA paper on GenAI‑based carbon‑capture discovery.

Reinforcement Learning

👍 👎 ♥ Save

Reinforcement Learning with Stochastic Reward Machines

Abstract
Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.

👍 👎 ♥ Save

Asymptotically optimal reinforcement learning in Block Markov Decision Processes

Eindhoven University of T

Abstract
The curse of dimensionality renders Reinforcement Learning (RL) impractical in many real-world settings with exponentially large state and action spaces. Yet, many environments exhibit exploitable structure that can accelerate learning. To formalize this idea, we study RL in Block Markov Decision Processes (BMDPs). BMDPs model problems with large observation spaces, but where transition dynamics are fully determined by latent states. Recent advances in clustering methods have enabled the efficient recovery of this latent structure. However, a regret analysis that exploits these techniques to determine their impact on learning performance remained open. We are now addressing this gap by providing a regret analysis that explicitly leverages clustering, demonstrating that accurate latent state estimation can indeed effectively speed up learning. Concretely, this paper analyzes a two-phase RL algorithm for BMDPs that first learns the latent structure through random exploration and then switches to an optimism-guided strategy adapted to the uncovered structure. This algorithm achieves a regret that is $O(\sqrt{T}+n)$ on a large class of BMDPs susceptible to clustering. Here, $T$ denotes the number of time steps, $n$ is the cardinality of the observation space, and the Landau notation $O(\cdot)$ holds up to constants and polylogarithmic factors. This improves the best prior bound, $O(\sqrt{T}+n^2)$, especially when $n$ is large. Moreover, we prove that no algorithm can achieve lower regret uniformly on this same class of BMDPs. This establishes that, on this class, the algorithm achieves asymptotic optimality.

AI Insights

Sharp martingale‑difference concentration bounds give high‑probability control of value‑function gaps.
Random exploration is quantified by Azuma‑Hoeffding inequalities, ensuring accurate latent‑state clustering.
The optimism‑guided phase uses a regret decomposition that isolates clustering error effects.
A novel spectral‑norm bound on random matrices from observation‑to‑latent maps underpins the analysis.
The paper references Boucheron‑Lugosi‑Massart’s “Concentration Inequalities” and Tropp’s matrix tail bounds.
Definitions: “Concentration Inequality” bounds deviation probability; “Martingale Difference Sequence” is incremental martingale.
The results show clustering reduces regret from O(√T+n²) to the optimal O(√T+n), a dramatic speed‑up.

Help us improve your experience!