Synkrasis Labs, Athens
Abstract
Evaluating AI agents that solve real-world tasks through function-call
sequences remains an open challenge. Existing agentic benchmarks often reduce
evaluation to a binary judgment of the final state, overlooking critical
aspects such as safety, efficiency, and intermediate correctness. We propose a
framework based on deterministic finite automata (DFAs) that encodes tasks as
sets of valid tool-use paths, enabling principled assessment of agent behavior
in diverse world models. Building on this foundation, we introduce CORE, a
suite of five metrics, namely Path Correctness, Path Correctness - Kendall's
tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that
quantify alignment with expected execution patterns. Across diverse worlds, our
method reveals important performance differences between agents that would
otherwise appear equivalent under traditional final-state evaluation schemes.
AI Insights - PathāCorrectness is defined as the maximum pairwise similarity between a condensed agent path and each reference in the HLR candidate set, guaranteeing a bounded, monotonic score under edits.
- The CORE suite detects every failure mode that can arise from a call sequence relative to the DFA, not just the final state.
- Using the metrics together yields a comprehensive view of safety, efficiency, and intermediate correctness that singleāmetric tests miss.
- CORE reveals performance gaps between agents that appear equivalent under traditional finalāstate evaluation.
- The framework is limited to deterministic environments, lacking support for stochastic dynamics, fineāgrained timing, or continuous control.
- Humanāfacing UX quality and timing within calls are not captured by the current metrics, highlighting future research directions.
- The PathāCorrectness scoreās propertiesārange, perfect match, maximal mismatch, and monotonicityāensure robust comparison across diverse world models.
Abstract
Large language model (LLM) and agent techniques for data analysis (a.k.a
LLM/Agent-as-Data-Analyst) have demonstrated substantial impact in both
academica and industry. In comparison with traditional rule or small-model
based approaches, (agentic) LLMs enable complex data understanding, natural
language interfaces, semantic analysis functions, and autonomous pipeline
orchestration. The technical evolution further distills five key design goals
for intelligent data analysis agents, namely semantic-aware design,
modality-hybrid integration, autonomous pipelines, tool-augmented workflows,
and support for open-world tasks. From a modality perspective, we review
LLM-based techniques for (i) structured data (e.g., table question answering
for relational data and NL2GQL for graph data), (ii) semi-structured data
(e.g., markup languages understanding and semi-structured table modeling),
(iii) unstructured data (e.g., chart understanding, document understanding,
programming languages vulnerable detection), and (iv) heterogeneous data (e.g.,
data retrieval and modality alignment for data lakes). Finally, we outline the
remaining challenges and propose several insights and practical directions for
advancing LLM/Agent-powered data analysis.