LLMs for AI Agents

Where LLM Agents Fail and How They can Learn From Failures

University of Illinois at

Rate this image: 😍 👍 👎

Abstract
Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug

👍 👎 ♥ Save

LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions

Abstract
Recent advances in large language models (LLMs) have enabled a new class of AI agents that automate multiple stages of the data science workflow by integrating planning, tool use, and multimodal reasoning across text, code, tables, and visuals. This survey presents the first comprehensive, lifecycle-aligned taxonomy of data science agents, systematically analyzing and mapping forty-five systems onto the six stages of the end-to-end data science process: business understanding and data acquisition, exploratory analysis and visualization, feature engineering, model building and selection, interpretation and explanation, and deployment and monitoring. In addition to lifecycle coverage, we annotate each agent along five cross-cutting design dimensions: reasoning and planning style, modality integration, tool orchestration depth, learning and alignment methods, and trust, safety, and governance mechanisms. Beyond classification, we provide a critical synthesis of agent capabilities, highlight strengths and limitations at each stage, and review emerging benchmarks and evaluation practices. Our analysis identifies three key trends: most systems emphasize exploratory analysis, visualization, and modeling while neglecting business understanding, deployment, and monitoring; multimodal reasoning and tool orchestration remain unresolved challenges; and over 90% lack explicit trust and safety mechanisms. We conclude by outlining open challenges in alignment stability, explainability, governance, and robust evaluation frameworks, and propose future research directions to guide the development of robust, trustworthy, low-latency, transparent, and broadly accessible data science agents.

AI Agents

👍 👎 ♥ Save

Agentic Services Computing

Zhejiang University, The

Abstract
The rise of LLM-powered agents is driving a fundamental transformation in services computing: from static, request-response functions to dynamic, goal-oriented, and autonomous multi-agent ecosystems. In response to this shift, we introduce Agentic Service Computing (ASC), a new paradigm that reimagines services as intelligent, self-adaptive, and socially embedded entities. This comprehensive survey presents a lifecycle-driven framework for ASC, structured around four core phases: Design, Deployment, Operation, and Evolution. We systematically analyze ASC through four foundational research dimensions: (1) Perception, Context, and Environment Modeling, (2) Autonomous Decision-Making and Task Execution, (3) Multi-Agent Collaboration and Organization, and (4) Evaluation, Value Alignment, and Trustworthiness. We examine how these dimensions are instantiated, integrated, and continuously adapted across the service lifecycle. Our synthesis reveals that agentic services are not merely assembled but orchestrated: contextual awareness enables robust deployment; autonomous reasoning supports real-time operation; collaborative structures emerge and evolve through interaction; and trustworthiness must be upheld as a cross-cutting, lifelong imperative. We further identify and discuss emerging trends shaping the future of ASC. By integrating classical principles of services computing with advances in LLM-based multi-agent systems, this work establishes a holistic and forward-looking foundation for ASC. It provides a unified reference for researchers and practitioners aiming to develop adaptive, accountable, and human-centered intelligent services.

AI Insights

Federated learning enables privacy‑preserving on‑device updates for agentic services.
Formal verification can guarantee safety of autonomous decision modules in multi‑agent ecosystems.
Dynamic resource schedulers adapt to workload shifts, preserving QoS in agentic clusters.
OpenAPI extensions for agentic interactions standardize cross‑domain collaboration.
Benchmarks that score explainability, latency, and trust guide agentic framework comparison.
Human‑in‑the‑loop UIs let users steer agentic goals while preserving autonomy.
Edge‑centric deployments cut latency and boost resilience for distributed agentic services.

👍 👎 ♥ Save

Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice

Mass General Brigham, MIT

Abstract
Large language models (LLMs) integrated into agent-driven workflows hold immense promise for healthcare, yet a significant gap exists between their potential and practical implementation within clinical settings. To address this, we present a practitioner-oriented field manual for deploying generative agents that use electronic health record (EHR) data. This guide is informed by our experience deploying the "irAE-Agent", an automated system to detect immune-related adverse events from clinical notes at Mass General Brigham, and by structured interviews with 20 clinicians, engineers, and informatics leaders involved in the project. Our analysis reveals a critical misalignment in clinical AI development: less than 20% of our effort was dedicated to prompt engineering and model development, while over 80% was consumed by the sociotechnical work of implementation. We distill this effort into five "heavy lifts": data integration, model validation, ensuring economic value, managing system drift, and governance. By providing actionable solutions for each of these challenges, this field manual shifts the focus from algorithmic development to the essential infrastructure and implementation work required to bridge the "valley of death" and successfully translate generative AI from pilot projects into routine clinical care.

AI and Society

👍 👎 ♥ Save

Signaling in the Age of AI: Evidence from Cover Letters

Yale Department of Econom

Rate this image: 😍 👍 👎

Abstract
We study how generative AI affects labor market signaling using the introduction of an AI-powered cover letter writing tool on Freelancer.com. Our data track both access to the tool and usage at the application level. Difference-in-differences estimates show that access to the AI tool increased textual alignment between cover letters and job posts--which we refer to as cover letter tailoring--and raised callback likelihoods. Workers with weaker pre-AI writing skills saw larger improvements in cover letters, indicating that AI substitutes for workers' own skills. Although only a minority of applications used the tool, the overall correlation between cover letter tailoring and callbacks fell by 51%, implying that cover letters became less informative signals of worker ability in the age of AI. Employers correspondingly shifted toward alternative signals, such as workers' past reviews, which became more predictive of hiring. Finally, within the treated group, greater time spent editing AI drafts was associated with higher hiring success.

AI Insights

A regression discontinuity design around the platform’s eligibility cutoff isolates the AI tool’s effect.
The screening model shows AI tailoring substitutes weaker writers’ effort, reshaping employer weighting.
Hiring probability rises 3.5 percentage points, especially for high‑skill candidates.
AI cuts editing time by 2.5 minutes per application, yet more editing boosts success.
Post‑AI, tailoring‑callback correlation drops 51%, shifting focus to past review scores.
Findings suggest policy incentives for AI adoption but warn that over‑reliance may erode narrative value.
Read “The Impact of Artificial Intelligence on Society” and the Journal of Labor Economics’ AI special issue for deeper context.

👍 👎 ♥ Save

Know Thyself? On the Incapability and Implications of AI Self-Recognition

Abstract
Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety, particularly in evaluative scenarios. Motivated by contradictory interpretations of whether models possess self-recognition (Panickssery et al., 2024; Davidson et al., 2024), we introduce a systematic evaluation framework that can be easily applied and updated. Specifically, we measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models through two tasks: binary self-recognition and exact model prediction. Different from prior claims, our results reveal a consistent failure in self-recognition. Only 4 out of 10 models predict themselves as generators, and the performance is rarely above random chance. Additionally, models exhibit a strong bias toward predicting GPT and Claude families. We also provide the first evaluation of model awareness of their own and others' existence, as well as the reasoning behind their choices in self-recognition. We find that the model demonstrates some knowledge of its own existence and other models, but their reasoning reveals a hierarchical bias. They appear to assume that GPT, Claude, and occasionally Gemini are the top-tier models, often associating high-quality text with them. We conclude by discussing the implications of our findings on AI safety and future directions to develop appropriate AI self-awareness.

Research Automation with AI

👍 👎 ♥ Save

Atlas of Human-AI Interaction (v1): An Interactive Meta-Science Platform for Large-Scale Research Literature Sensemaking

MIT Media Lab, USA

Rate this image: 😍 👍 👎

Abstract
Human-AI interaction researchers face an overwhelming challenge: synthesizing insights from thousands of empirical studies to understand how AI impacts people and inform effective design. Existing approach for literature reviews cluster papers by similarities, keywords or citations, missing the crucial cause-and-effect relationships that reveal how design decisions impact user outcomes. We introduce the Atlas of Human-AI Interaction, an interactive web interface that provides the first systematic mapping of empirical findings across 1,000+ HCI papers using LLM-powered knowledge extraction. Our approach identifies causal relationships, and visualizes them through an AI-enabled interactive web interface as a navigable knowledge graph. We extracted 2,037 empirical findings, revealing research topic clusters, common themes, and disconnected areas. Expert evaluation with 20 researchers revealed the system's effectiveness for discovering research gaps. This work demonstrates how AI can transform literature synthesis itself, offering a scalable framework for evidence-based design, opening new possibilities for computational meta-science across HCI and beyond.

👍 👎 ♥ Save

On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

Northwestern University

Abstract
Investigative journalists routinely confront large document collections. Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery, but newsroom adoption remains limited due to hallucination risks, verification burden, and data privacy concerns. We present a journalist-centered approach to LLM-powered document search that prioritizes transparency and editorial control through a five-stage pipeline -- corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis -- using small, locally-deployable language models that preserve data security and maintain complete auditability through explicit citation chains. Evaluating three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we find substantial variation in reliability. All models achieved high citation validity and ran effectively on standard desktop hardware (e.g., 24 GB of memory), demonstrating feasibility for resource-constrained newsrooms. However, systematic challenges emerged, including error propagation through multi-stage synthesis and dramatic performance variation based on training data overlap with corpus content. These findings suggest that effective newsroom AI deployment requires careful model selection and system design, alongside human oversight for maintaining standards of accuracy and accountability.

AGI: Artificial General Intelligence

👍 👎 ♥ Save

Improving AGI Evaluation: A Data Science Perspective

Pingla Institute, Sydney

Rate this image: 😍 👍 👎

Abstract
Evaluation of potential AGI systems and methods is difficult due to the breadth of the engineering goal. We have no methods for perfect evaluation of the end state, and instead measure performance on small tests designed to provide directional indication that we are approaching AGI. In this work we argue that AGI evaluation methods have been dominated by a design philosophy that uses our intuitions of what intelligence is to create synthetic tasks, that have performed poorly in the history of AI. Instead we argue for an alternative design philosophy focused on evaluating robust task execution that seeks to demonstrate AGI through competence. This perspective is developed from common practices in data science that are used to show that a system can be reliably deployed. We provide practical examples of what this would mean for AGI evaluation.

AI Insights

Out‑of‑time cross‑validation is proposed to guard against dataset shift in AGI benchmarks.
Leave‑one‑out distinguishability tests whether a model truly generalises beyond memorised samples.
Cluster‑data splits expose hidden structure that can inflate performance estimates.
The paper argues that robust task execution, not synthetic intuition‑driven tests, should define AGI competence.
It stresses a multidisciplinary panel—data scientists, AI theorists, philosophers—to audit evaluation protocols.
Practical examples illustrate how to deploy these data‑science checks in real‑world AGI trials.
The authors warn that over‑fitting and memorisation can masquerade as intelligence if cross‑validation is ignored.

👍 👎 ♥ Save

Scaling Generalist Data-Analytic Agents

Zhejiang University,AlibA

Abstract
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.

AI Insights

The paper reveals a modular prompt framework that orchestrates judge and trajectory sampling models for systematic evaluation of data‑analytic agents.
It introduces a hierarchical tagging scheme that encodes output formatting, enabling automated parsing of multi‑turn code rollouts.
The evaluation protocol explicitly separates question generation, answer generation, and answer assessment, mirroring a formal peer‑review workflow.
A key insight is that the prompts embed domain‑specific constraints, allowing the agent to adapt to diverse data‑analysis tasks without manual re‑engineering.
The authors note that the formal language of the prompts may hinder accessibility, suggesting future work on user‑friendly interfaces.
The lack of contextual background in the prompts highlights a gap that could be bridged by integrating explanatory metadata into the prompt design.
The framework’s reliance on pre‑defined evaluation tags points to an opportunity for automated metric extraction and continuous learning.

Deep Learning

👍 👎 ♥ Save

Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning

Nankai University, Tianjn

Abstract
In deep learning, dense layer connectivity has become a key design principle in deep neural networks (DNNs), enabling efficient information flow and strong performance across a range of applications. In this work, we model densely connected DNNs mathematically and analyze their learning problems in the deep-layer limit. For a broad applicability, we present our analysis in a framework setting of DNNs with densely connected layers and general non-local feature transformations (with local feature transformations as special cases) within layers, which is called dense non-local (DNL) framework and includes standard DenseNets and variants as special examples. In this formulation, the densely connected networks are modeled as nonlinear integral equations, in contrast to the ordinary differential equation viewpoint commonly adopted in prior works. We study the associated training problems from an optimal control perspective and prove convergence results from the network learning problem to its continuous-time counterpart. In particular, we show the convergence of optimal values and the subsequence convergence of minimizers, using a piecewise linear extension and $\Gamma$-convergence analysis. Our results provide a mathematical foundation for understanding densely connected DNNs and further suggest that such architectures can offer stability of training deep models.

AI Insights

Forward‑backward‑splitting networks converge in the deep‑layer limit, revealing new stability insights.
Learned primal‑dual schemes are modeled as dynamical systems with a linear operator K, enabling Lyapunov‑style analysis.
A piecewise‑linear extension links discrete layers to continuous time, yielding a Γ‑convergence proof of optimal values.
The framework includes DenseNet variants and non‑local feature transforms, suggesting unexplored hybrid architectures.
Brunner’s “Volterra Integral Equations” and Braides’ “Γ‑convergence for Beginners” are key resources for the theory.
The work builds on Haber, Lu, and Ruthotto’s PDE‑inspired DNN research, situating it in physics‑informed deep learning.
Though mathematically dense, the paper encourages experimenting with forward‑backward‑splitting layers for future stability gains.

👍 👎 ♥ Save

A Generalized Information Bottleneck Theory of Deep Learning

University College London

Abstract
The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.

Help us improve your experience!