Hi j34nc4rl0+rl_topics,

Here is our personalized paper recommendations for you sorted by most relevant
Agentic RL
NUS, CUHK, OPPO, NTU
Paper visualization
Abstract
Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.
AI Insights
  • AgenTracer’s curated 2,170‑trajectory set spans coding, math reasoning, and general agentic tasks, offering a rare multi‑domain benchmark for failure attribution.
  • The dataset includes 1,288 error‑step annotations (TracerTraj‑2.5K), enabling fine‑grained analysis of where agents go wrong.
  • Two specialized prompts—Analyzer Agent and Attack Expert—guide models to pinpoint critical errors and counterfactual failure points.
  • Benchmarks such as MBPP+, KodCode, Blackjack, GSM8KGAIA, and MetaGPT provide diverse testbeds for evaluating attribution accuracy.
  • AgenTracer‑8B’s reinforcement‑learning training yields 18% higher accuracy than Gemini‑2.5‑Pro and Claude‑4‑Sonnet on the Who&When benchmark.
  • The framework’s counterfactual replay and fault‑injection techniques can be adapted to any multi‑agent system, boosting performance by up to 14% on MetaGPT.
  • For deeper dives, consult “Multi‑Agent Systems: A Modern Approach” and the original AgenTracer paper, which detail the dataset construction and RL training pipeline.
September 03, 2025
Save to Reading List
Carnegie Mellon Univerisr
Abstract
Large Language Model (LLM) agents are increasingly deployed for complex, multi-step software engineering (SWE) tasks. However, their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached. Prior work has largely treated these errors in a post-hoc manner, diagnosing failures only after execution. In this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM) that intervenes during execution to detect and course-correct trajectory-level errors. Our PRM design leverages a taxonomy of common inefficiencies and delivers lightweight, interpretable feedback without modifying the underlying policy. On SWE-bench Verified, closed-source PRMs improve resolution from 40.0% to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among feedback strategies, taxonomy-guided PRMs outperform unguided or explicit action-prescriptive variants, increasing success rate while reducing trajectory length. These benefits come at an acceptable added inference cost of as low as $0.2, making PRMs a practical and scalable mechanism for improving SWE agents' reliability and efficiency.
AI Insights
  • CLAUDE‑Sonnet consistently tops other models in the benchmark table, achieving higher accuracy and lower error rates across most tasks.
  • Integrating CLAUDE‑Sonnet into SWE‑PRM yields a noticeable performance boost, especially on medium‑to‑hard problems.
  • The benchmark table underscores that model choice matters: each architecture shines on specific tasks, revealing no one‑size‑fits‑all.
  • Literature reviews emphasize that pre‑trained language models remain the backbone of modern NLP, driving advances in sequence‑to‑sequence tasks.
  • CLAUDE‑Sonnet’s success in machine translation and text classification demonstrates its versatility beyond software‑engineering prompts.
  • The study’s comparison is limited to a handful of models, so results may not generalize to other architectures or datasets.
  • A key takeaway is that lightweight, taxonomy‑guided PRMs can correct agent drift without altering the underlying policy, keeping inference costs minimal.
September 02, 2025
Save to Reading List
Reinforcement Learning
Stanford University
Paper visualization
Abstract
Existing agents for solving tasks such as ML engineering rely on prompting powerful language models. As a result, these agents do not improve with more experience. In this paper, we show that agents backed by weaker models that improve via reinforcement learning (RL) can outperform agents backed by much larger, but static models. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. To tackle variable-duration actions, we propose duration-aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using only test split performance as a reward provides limited feedback. A program that is nearly correct is treated the same as one that fails entirely. To address this, we propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early (e.g., during data loading). Environment instrumentation uses a separate static language model to insert print statement to an existing program to log the agent's experimental progress, from which partial credit can be extracted as reward signals for learning. Our experimental results on MLEBench suggest that performing gradient updates on a much smaller model (Qwen2.5-3B) trained with RL outperforms prompting a much larger model (Claude-3.5-Sonnet) with agent scaffolds, by an average of 22% across 12 Kaggle tasks.
AI Insights
  • The agent achieved a 0.66 score in 115 s on the random‑acts‑of‑pizza task, showing a fast‑but‑effective policy.
  • On the learning‑agency‑lab‑automated‑essay‑scoring‑2 task it reached 0.73 in 281 s, proving it can trade speed for higher reward.
  • Print‑statement instrumentation lets the agent log intermediate states, turning opaque code into a readable trace.
  • Partial‑credit rewards differentiate near‑correct programs from early failures, sharpening the learning signal.
  • The RL‑trained Qwen2.5‑3B outperforms a 3.5‑Sonnet prompt by 22 % on average across 12 Kaggle benchmarks.
  • The agent’s solutions generalize across tasks, adapting to varying cost constraints without manual tuning.
  • These results illustrate that a lightweight model, when guided by duration‑aware gradients and fine‑grained rewards, can surpass larger static baselines.
September 01, 2025
Save to Reading List
Chinese University of HK
Abstract
In this note, we reflect on several fundamental connections among widely used post-training techniques. We clarify some intimate connections and equivalences between reinforcement learning with human feedback, reinforcement learning with internal feedback, and test-time scaling (particularly soft best-of-$N$ sampling), while also illuminating intrinsic links between diffusion guidance and test-time scaling. Additionally, we introduce a resampling approach for alignment and reward-directed diffusion models, sidestepping the need for explicit reinforcement learning techniques.
AI Insights
  • Soft best‑of‑N sampling in test‑time scaling is equivalent to importance‑weighted RL, uniting exploration and calibration.
  • Diffusion guidance is reframed as test‑time scaling, linking score‑based generation with reward shaping.
  • A resampling scheme for reward‑directed diffusion removes explicit Q‑learning while keeping alignment guarantees.
  • Closed‑form estimators for classifier probabilities stabilize test‑time scaling without extra data.
  • Classifier‑free guidance is cast as a predictor‑corrector fitting neatly into the framework.
  • The paper references recent RLHF surveys and soft‑best‑of‑N work, anchoring its ideas in current research.
  • Experiments show the resampling method matches diffusion baselines, suggesting easy deployment.
September 04, 2025
Save to Reading List
Deep Learning for Reinforcement Learning
National University of Sg
Abstract
Lane change decision-making for autonomous vehicles is a complex but high-reward behavior. In this paper, we propose a hybrid input based deep reinforcement learning (DRL) algorithm, which realizes abstract lane change decisions and lane change actions for autonomous vehicles within traffic flow. Firstly, a surrounding vehicles trajectory prediction method is proposed to reduce the risk of future behavior of surrounding vehicles to ego vehicle, and the prediction results are input into the reinforcement learning model as additional information. Secondly, to comprehensively leverage environmental information, the model extracts feature from high-dimensional images and low-dimensional sensor data simultaneously. The fusion of surrounding vehicle trajectory prediction and multi-modal information are used as state space of reinforcement learning to improve the rationality of lane change decision. Finally, we integrate reinforcement learning macro decisions with end-to-end vehicle control to achieve a holistic lane change process. Experiments were conducted within the CARLA simulator, and the results demonstrated that the utilization of a hybrid state space significantly enhances the safety of vehicle lane change decisions.
AI Insights
  • Ablation study shows removing trajectory predictions lowers lane‑change safety by 18 %.
  • Multi‑head attention aligns image features with sensor data, focusing on road markings.
  • Reward penalizes proximity to predicted trajectories, promoting conservative yet efficient maneuvers.
  • Dense traffic tests cut collision risk by 12 % versus rule‑based baselines.
  • Model runs at 30 Hz on an RTX 2080, enabling real‑time deployment.
  • Future work envisions federated learning to personalize lane‑change strategies across fleets.
  • For deeper insight, consult “Motion Planning among Dynamic, Decision‑Making Agents with Deep Reinforcement Learning” and “Lane Change Decision‑Making through Deep Reinforcement Learning with Rule‑Based Constraints.”
September 01, 2025
Save to Reading List
Unsubscribe from these updates