LLMs for AI Agents

CORE: Full-Path Evaluation of LLM Agents Beyond Final State

Synkrasis Labs, Athens

Abstract
Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.

AI Insights

Path‑Correctness is defined as the maximum pairwise similarity between a condensed agent path and each reference in the HLR candidate set, guaranteeing a bounded, monotonic score under edits.
The CORE suite detects every failure mode that can arise from a call sequence relative to the DFA, not just the final state.
Using the metrics together yields a comprehensive view of safety, efficiency, and intermediate correctness that single‑metric tests miss.
CORE reveals performance gaps between agents that appear equivalent under traditional final‑state evaluation.
The framework is limited to deterministic environments, lacking support for stochastic dynamics, fine‑grained timing, or continuous control.
Human‑facing UX quality and timing within calls are not captured by the current metrics, highlighting future research directions.
The Path‑Correctness score’s properties—range, perfect match, maximal mismatch, and monotonicity—ensure robust comparison across diverse world models.

👍 👎 ♥ Save

LLM/Agent-as-Data-Analyst: A Survey

Abstract
Large language model (LLM) and agent techniques for data analysis (a.k.a LLM/Agent-as-Data-Analyst) have demonstrated substantial impact in both academica and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. The technical evolution further distills five key design goals for intelligent data analysis agents, namely semantic-aware design, modality-hybrid integration, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., table question answering for relational data and NL2GQL for graph data), (ii) semi-structured data (e.g., markup languages understanding and semi-structured table modeling), (iii) unstructured data (e.g., chart understanding, document understanding, programming languages vulnerable detection), and (iv) heterogeneous data (e.g., data retrieval and modality alignment for data lakes). Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.

AI Agents

👍 👎 ♥ Save

The STAR-XAI Protocol: An Interactive Framework for Inducing Second-Order Agency in AI Agents

Ixent Games

Abstract
Current Large Reasoning Models (LRMs) exhibit significant limitations in reliability and transparency, often showing a collapse in reasoning capabilities when faced with high-complexity, long-horizon tasks. This "illusion of thinking" is frequently an artifact of non-agentic, black-box evaluation paradigms that fail to cultivate robust problem-solving processes. In response, we introduce The STAR-XAI Protocol (Socratic, Transparent, Agentic, Reasoning - for eXplainable Artificial Intelligence), a novel methodology for training and operating verifiably reliable AI agents. Our method reframes the human-AI interaction as a structured, Socratic dialogue, governed by an explicit and evolving rulebook, the Consciousness Transfer Package (CTP). Through an interactive Gameplay Cycle that enforces ante-hoc strategic justification and a state-locking Checksum that prevents error accumulation, the protocol transforms a powerful but opaque LRM into a disciplined "Clear Box" agent. We demonstrate the efficacy of this method through an exhaustive 25-move case study in the complex strategic game "Caps i Caps". The agent not only solved the high-complexity puzzle but also demonstrated Second-Order Agency, identifying flaws in its own supervisor-approved plans and adapting its core integrity protocols mid-task. The STAR-XAI Protocol offers a practical pathway to creating AI agents that are not just high-performing, but also transparent, auditable, and trustworthy by design.

AI Insights

The Consciousness Transfer Package (CTP) is a step‑by‑step manual for mastering gear‑based board games, covering placement, rotation, and vector mechanics.
CTP provides concrete examples of successful moves, letting Gems learn proven strategies instead of trial‑and‑error.
The package is designed for seamless handoff, so one Gem can train an agent that another can use without knowledge loss.
Recommended literature includes reasoning classics, game‑theory treatises, and studies on gear‑placement efficiency.
Online forums and simulation tools are highlighted as practical resources for testing and refining gear‑game tactics.
A caveat: the CTP’s depth may overwhelm novices and assumes baseline familiarity with gear‑game mechanics.
Core definitions—gear, placement, rotation, vector, base—ensure consistent terminology across training sessions.

👍 👎 ♥ Save

Socio-Economic Model of AI Agents

Abstract
Modern socio-economic systems are undergoing deep integration with artificial intelligence technologies. This paper constructs a heterogeneous agent-based modeling framework that incorporates both human workers and autonomous AI agents, to study the impact of AI collaboration under resource constraints on aggregate social output. We build five progressively extended models: Model 1 serves as the baseline of pure human collaboration; Model 2 introduces AI as collaborators; Model 3 incorporates network effects among agents; Model 4 treats agents as independent producers; and Model 5 integrates both network effects and independent agent production. Through theoretical derivation and simulation analysis, we find that the introduction of AI agents can significantly increase aggregate social output. When considering network effects among agents, this increase exhibits nonlinear growth far exceeding the simple sum of individual contributions. Under the same resource inputs, treating agents as independent producers provides higher long-term growth potential; introducing network effects further demonstrates strong characteristics of increasing returns to scale.

AI and Society

👍 👎 ♥ Save

The three main doctrines on the future of AI

University of Buenos Ares

Rate this image: 😍 👍 👎

Abstract
This paper develops a taxonomy of expert perspectives on the risks and likely consequences of artificial intelligence, with particular focus on Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI). Drawing from primary sources, we identify three predominant doctrines: (1) The dominance doctrine, which predicts that the first actor to create sufficiently advanced AI will attain overwhelming strategic superiority sufficient to cheaply neutralize its opponents' defenses; (2) The extinction doctrine, which anticipates that humanity will likely lose control of ASI, leading to the extinction of the human species or its permanent disempowerment; (3) The replacement doctrine, which forecasts that AI will automate a large share of tasks currently performed by humans, but will not be so transformative as to fundamentally reshape or bring an end to human civilization. We examine the assumptions and arguments underlying each doctrine, including expectations around the pace of AI progress and the feasibility of maintaining advanced AI under human control. While the boundaries between doctrines are sometimes porous and many experts hedge across them, this taxonomy clarifies the core axes of disagreement over the anticipated scale and nature of the consequences of AI development.

Research Automation with AI

👍 👎 ♥ Save

AutoClimDS: Climate Data Science Agentic AI -- A Knowledge Graph is All You Need

Rate this image: 😍 👍 👎

Abstract
Climate data science faces persistent barriers stemming from the fragmented nature of data sources, heterogeneous formats, and the steep technical expertise required to identify, acquire, and process datasets. These challenges limit participation, slow discovery, and reduce the reproducibility of scientific workflows. In this paper, we present a proof of concept for addressing these barriers through the integration of a curated knowledge graph (KG) with AI agents designed for cloud-native scientific workflows. The KG provides a unifying layer that organizes datasets, tools, and workflows, while AI agents -- powered by generative AI services -- enable natural language interaction, automated data access, and streamlined analysis. Together, these components drastically lower the technical threshold for engaging in climate data science, enabling non-specialist users to identify and analyze relevant datasets. By leveraging existing cloud-ready API data portals, we demonstrate that "a knowledge graph is all you need" to unlock scalable and agentic workflows for scientific inquiry. The open-source design of our system further supports community contributions, ensuring that the KG and associated tools can evolve as a shared commons. Our results illustrate a pathway toward democratizing access to climate data and establishing a reproducible, extensible framework for human--AI collaboration in scientific research.

👍 👎 ♥ Save

Responsible AI Technical Report

KT

Abstract
KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT's AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.

AI Insights

The risk taxonomy categorizes threats into data, model, deployment, and societal dimensions, each with measurable indicators.
A multi‑stage assessment pipeline integrates static code analysis, adversarial testing, and human‑in‑the‑loop audits to quantify robustness.
SafetyGuard employs a lightweight transformer‑based policy network that intercepts outputs in real time, achieving <5 ms latency on edge devices.
Compliance mapping aligns each risk factor with specific clauses of the Basic Act on AI, enabling automated audit reports.
Pilot deployments in Korean telecom and finance sectors demonstrated a 30 % reduction in policy‑violating incidents after Guardrail integration.
The report proposes a future research agenda on explainable mitigation strategies and cross‑border data‑sharing protocols.

AGI: Artificial General Intelligence

👍 👎 ♥ Save

Limitations on Safe, Trusted, Artificial General Intelligence

Abstract
Safety, trust and Artificial General Intelligence (AGI) are aspirational goals in artificial intelligence (AI) systems, and there are several informal interpretations of these notions. In this paper, we propose strict, mathematical definitions of safety, trust, and AGI, and demonstrate a fundamental incompatibility between them. We define safety of a system as the property that it never makes any false claims, trust as the assumption that the system is safe, and AGI as the property of an AI system always matching or exceeding human capability. Our core finding is that -- for our formal definitions of these notions -- a safe and trusted AI system cannot be an AGI system: for such a safe, trusted system there are task instances which are easily and provably solvable by a human but not by the system. We note that we consider strict mathematical definitions of safety and trust, and it is possible for real-world deployments to instead rely on alternate, practical interpretations of these notions. We show our results for program verification, planning, and graph reachability. Our proofs draw parallels to G\"odel's incompleteness theorems and Turing's proof of the undecidability of the halting problem, and can be regarded as interpretations of G\"odel's and Turing's results.

Deep Learning

👍 👎 ♥ Save

Algorithms for Adversarially Robust Deep Learning

University of Pennsylvann

Rate this image: 😍 👍 👎

Abstract
Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.

AI Insights

Random erasing data augmentation injects stochastic occlusions during training, boosting pixel‑level robustness.
Stability training enforces Lipschitz continuity across layers, yielding provable robustness margins.
Robust prompt optimization tailors LLM inputs to shrink jailbreak‑induced decision space.
Universal adversarial attacks generate a single perturbation that transfers across many inputs, breaking input‑specific defenses.
Randomness in SGD can amplify or dampen adversarial vulnerability, depending on learning‑rate schedules.
Tooling—automated augmentation pipelines and reproducibility frameworks—drives consistent robustness across labs.
“Essentials of Robust Control” links classical control theory to deep learning, providing a rigorous basis for safe neural systems.

👍 👎 ♥ Save

Development of Deep Learning Optimizers: Approaches, Concepts, and Update Rules

Istanbul Medeniyet Univer

Abstract
Deep learning optimizers are optimization algorithms that enable deep neural networks to learn. The effectiveness of learning is highly dependent on the optimizer employed in the training process. Alongside the rapid advancement of deep learning, a wide range of optimizers with different approaches have been developed. This study aims to provide a review of various optimizers that have been proposed and received attention in the literature. From Stochastic gradient descent to the most recent ones such as Momentum, AdamW, Sophia, and Muon in chronological order, optimizers are examined individually, and their distinctive features are highlighted in the study. The update rule of each optimizer is presented in detail, with an explanation of the associated concepts and variables. The techniques applied by these optimizers, their contributions to the optimization process, and their default hyperparameter settings are also discussed. In addition, insights are offered into the open challenges encountered in the optimization of deep learning models. Thus, a comprehensive resource is provided both for understanding the current state of optimizers and for identifying potential areas of future development.

Help us improve your experience!