NUS, CUHK, OPPO, NTU
Abstract
Large Language Model (LLM)-based agentic systems, often comprising multiple
models, complex tool invocations, and orchestration protocols, substantially
outperform monolithic agents. Yet this very sophistication amplifies their
fragility, making them more prone to system failure. Pinpointing the specific
agent or step responsible for an error within long execution traces defines the
task of agentic system failure attribution. Current state-of-the-art reasoning
LLMs, however, remain strikingly inadequate for this challenge, with accuracy
generally below 10%. To address this gap, we propose AgenTracer, the first
automated framework for annotating failed multi-agent trajectories via
counterfactual replay and programmed fault injection, producing the curated
dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a
lightweight failure tracer trained with multi-granular reinforcement learning,
capable of efficiently diagnosing errors in verbose multi-agent interactions.
On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs
like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard
in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers
actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS
with 4.8-14.2% performance gains, empowering self-correcting and self-evolving
agentic AI.
AI Insights - AgenTracer’s curated 2,170‑trajectory set spans coding, math reasoning, and general agentic tasks, offering a rare multi‑domain benchmark for failure attribution.
- The dataset includes 1,288 error‑step annotations (TracerTraj‑2.5K), enabling fine‑grained analysis of where agents go wrong.
- Two specialized prompts—Analyzer Agent and Attack Expert—guide models to pinpoint critical errors and counterfactual failure points.
- Benchmarks such as MBPP+, KodCode, Blackjack, GSM8KGAIA, and MetaGPT provide diverse testbeds for evaluating attribution accuracy.
- AgenTracer‑8B’s reinforcement‑learning training yields 18% higher accuracy than Gemini‑2.5‑Pro and Claude‑4‑Sonnet on the Who&When benchmark.
- The framework’s counterfactual replay and fault‑injection techniques can be adapted to any multi‑agent system, boosting performance by up to 14% on MetaGPT.
- For deeper dives, consult “Multi‑Agent Systems: A Modern Approach” and the original AgenTracer paper, which detail the dataset construction and RL training pipeline.
Carnegie Mellon Univerisr
Abstract
Large Language Model (LLM) agents are increasingly deployed for complex,
multi-step software engineering (SWE) tasks. However, their trajectories often
contain costly inefficiencies, such as redundant exploration, looping, and
failure to terminate once a solution is reached. Prior work has largely treated
these errors in a post-hoc manner, diagnosing failures only after execution. In
this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM)
that intervenes during execution to detect and course-correct trajectory-level
errors. Our PRM design leverages a taxonomy of common inefficiencies and
delivers lightweight, interpretable feedback without modifying the underlying
policy. On SWE-bench Verified, closed-source PRMs improve resolution from 40.0%
to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among
feedback strategies, taxonomy-guided PRMs outperform unguided or explicit
action-prescriptive variants, increasing success rate while reducing trajectory
length. These benefits come at an acceptable added inference cost of as low as
$0.2, making PRMs a practical and scalable mechanism for improving SWE agents'
reliability and efficiency.
AI Insights - CLAUDE‑Sonnet consistently tops other models in the benchmark table, achieving higher accuracy and lower error rates across most tasks.
- Integrating CLAUDE‑Sonnet into SWE‑PRM yields a noticeable performance boost, especially on medium‑to‑hard problems.
- The benchmark table underscores that model choice matters: each architecture shines on specific tasks, revealing no one‑size‑fits‑all.
- Literature reviews emphasize that pre‑trained language models remain the backbone of modern NLP, driving advances in sequence‑to‑sequence tasks.
- CLAUDE‑Sonnet’s success in machine translation and text classification demonstrates its versatility beyond software‑engineering prompts.
- The study’s comparison is limited to a handful of models, so results may not generalize to other architectures or datasets.
- A key takeaway is that lightweight, taxonomy‑guided PRMs can correct agent drift without altering the underlying policy, keeping inference costs minimal.