IBM
Abstract
The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk.
We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context.
Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.
AI Summary - The framework can be used in conjunction with existing benchmarks to evaluate the performance of agentic AI systems. [3]
- Future extensions to STRIDE will include multimodal tasks, reinforcement learning for weight tuning, and validation at enterprise scale. [3]
- STRIDE's scoring functions are heuristic by design, striking a balance between interpretability and generality. [3]
- STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator) is a framework that determines when tasks require agentic AI, AI assistants, or simple LLM calls. [2]
- STRIDE integrates five analytical dimensions: structured task decomposition, dynamic reasoning and tool-interaction scoring, dynamism attribution analysis, self-reflection requirement assessment, and agentic suitability inference. [1]
ulamai
Abstract
We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $Ξ½_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $Ξ$, and we identify the coefficient of self-improvement $ΞΊ$ as the Lie derivative of the capability functional along this flow.
The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $ΞΊ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough.
We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.
AI Summary - The GVU framework is used to analyze the stability of self-improvement in AI systems. [3]
- The Variance Inequality (Theorem 4.1) provides a sufficient condition for stable self-improvement, requiring a high Signal-to-Noise Ratio (SNR) for both the generator and the verifier. [3]
- AI slop event at parameter ΞΈ AI slop mass and slop regime The paper provides a framework for understanding the stability of self-improvement in AI systems, highlighting the importance of high SNR for both generators and verifiers. [3]
- The paper defines AI slop as a region where the internal Verifier ranks outputs among its top fraction, but they actually lie in the bottom fraction of the true battery score. [2]
- The paper introduces the Generalized Verifier-Generator Update (GVU) framework, which models the interaction between a generator and its verifier. [1]