LLMs for AI Agents

STRIDE: A Systematic Framework for Selecting AI Modalities -- Agentic AI, AI Assistants, or LLM Calls

IBM

Rate paper: 👍 👎 ♥ Save

Abstract
The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.

AI Summary

The framework can be used in conjunction with existing benchmarks to evaluate the performance of agentic AI systems. [3]
Future extensions to STRIDE will include multimodal tasks, reinforcement learning for weight tuning, and validation at enterprise scale. [3]
STRIDE's scoring functions are heuristic by design, striking a balance between interpretability and generality. [3]
STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator) is a framework that determines when tasks require agentic AI, AI assistants, or simple LLM calls. [2]
STRIDE integrates five analytical dimensions: structured task decomposition, dynamic reasoning and tool-interaction scoring, dynamism attribution analysis, self-reflection requirement assessment, and agentic suitability inference. [1]

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Google DeepMind

Rate paper: 👍 👎 ♥ Save

Abstract
Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

AI Agents

Self-Improving AI Agents through Self-Play

ulamai

Rate paper: 👍 👎 ♥ Save

Abstract
We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ν_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $Θ$, and we identify the coefficient of self-improvement $κ$ as the Lie derivative of the capability functional along this flow. The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $κ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough. We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.

AI Summary

The GVU framework is used to analyze the stability of self-improvement in AI systems. [3]
The Variance Inequality (Theorem 4.1) provides a sufficient condition for stable self-improvement, requiring a high Signal-to-Noise Ratio (SNR) for both the generator and the verifier. [3]
AI slop event at parameter θ AI slop mass and slop regime The paper provides a framework for understanding the stability of self-improvement in AI systems, highlighting the importance of high SNR for both generators and verifiers. [3]
The paper defines AI slop as a region where the internal Verifier ranks outputs among its top fraction, but they actually lie in the bottom fraction of the true battery score. [2]
The paper introduces the Generalized Verifier-Generator Update (GVU) framework, which models the interaction between a generator and its verifier. [1]

AI and Society

Artificial Intelligence / Human Intelligence: Who Controls Whom?

Ecole normale suprieure

Rate paper: 👍 👎 ♥ Save

Abstract
Using the example of the film 2001: A Space Odyssey, this chapter illustrates the challenges posed by an AI capable of making decisions that go against human interests. But are human decisions always rational and ethical? In reality, the cognitive decision-making process is influenced by cognitive biases that affect our behavior and choices. AI not only reproduces these biases, but can also exploit them, with the potential to shape our decisions and judgments. Behind IA algorithms, there are sometimes individuals who show little concern for fundamental rights and impose their own rules. To address the ethical and societal challenges raised by AI and its governance, the regulation of digital platforms and education are keys levers. Regulation must reflect ethical, legal, and political choices, while education must strengthen digital literacy and teach people to make informed and critical choices when facing digital technologies.

The dual footprint of artificial intelligence: environmental and social impacts across the globe

Polytechnic Institute of

Rate paper: 👍 👎 ♥ Save

Abstract
This article introduces the concept of the 'dual footprint' as a heuristic device to capture the commonalities and interdependencies between the different impacts of artificial intelligence (AI) on the natural and social surroundings that supply resources for its production and use. Two in-depth case studies, each illustrating international flows of raw materials and of data work services, portray the AI industry as a value chain that spans national boundaries and perpetuates inherited global inequalities. The countries that drive AI development generate a massive demand for inputs and trigger social costs that, through the value chain, largely fall on more peripheral actors. The arrangements in place distribute the costs and benefits of AI unequally, resulting in unsustainable practices and preventing the upward mobility of more disadvantaged countries. The dual footprint grasps how the environmental and social dimensions of the dual footprint emanate from similar underlying socioeconomic processes and geographical trajectories.

AI Summary

The carbon (and water) footprints of data centre functioning, model training, and inference mainly occur in countries that lead AI development, such as the United States and France. [3]
The supply of data work for countries like the United States and France comes from areas with lower labour costs, including middle- and lower-income countries like Argentina and Madagascar. [3]
The 'dual' nature of the footprint is illuminated by the fact that the same country exports both mining products and data work services, with imports flowing towards countries leading the worldwide AI race. [3]
AI value chain: The series of activities involved in developing and deploying artificial intelligence systems, from raw materials extraction to software development and deployment. [3]
Carbon footprint: The amount of greenhouse gas emissions associated with a particular activity or product. [3]
The analysis takes a step back from stricter interpretations of the footprint concept as an accounting method and instead focuses on a bird's eye view, revealing who is impacted by pressure on resources and related effects spread along the AI value chain. [2]

Research Automation with AI

A Hierarchical Tree-based approach for creating Configurable and Static Deep Research Agent (Static-DRA)

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
The advancement in Large Language Models has driven the creation of complex agentic systems, such as Deep Research Agents (DRAs), to overcome the limitations of static Retrieval Augmented Generation (RAG) pipelines in handling complex, multi-turn research tasks. This paper introduces the Static Deep Research Agent (Static-DRA), a novel solution built upon a configurable and hierarchical Tree-based static workflow. The core contribution is the integration of two user-tunable parameters, Depth and Breadth, which provide granular control over the research intensity. This design allows end-users to consciously balance the desired quality and comprehensiveness of the research report against the associated computational cost of Large Language Model (LLM) interactions. The agent's architecture, comprising Supervisor, Independent, and Worker agents, facilitates effective multi-hop information retrieval and parallel sub-topic investigation. We evaluate the Static-DRA against the established DeepResearch Bench using the RACE (Reference-based Adaptive Criteria-driven Evaluation) framework. Configured with a depth of 2 and a breadth of 5, and powered by the gemini-2.5-pro model, the agent achieved an overall score of 34.72. Our experiments validate that increasing the configured Depth and Breadth parameters results in a more in-depth research process and a correspondingly higher evaluation score. The Static-DRA offers a pragmatic and resource-aware solution, empowering users with transparent control over the deep research process. The entire source code, outputs and benchmark results are open-sourced at https://github.com/SauravP97/Static-Deep-Research/

The future of AI in critical mineral exploration

Stanford University

Rate paper: 👍 👎 ♥ Save

Abstract
The energy transition through increased electrification has put the worlds attention on critical mineral exploration Even with increased investments a decrease in new discoveries has taken place over the last two decades Here I propose a solution to this problem where AI is implemented as the enabler of a rigorous scientific method for mineral exploration that aims to reduce cognitive bias and false positives drive down the cost of exploration I propose a new scientific method that is based on a philosophical approach founded on the principles of Bayesianism and falsification In this approach data acquisition is in the first place seen as a means to falsify human generated hypothesis Decision of what data to acquire next is quantified with verifiable metrics and based on rational decision making A practical protocol is provided that can be used as a template in any exploration campaign However in order to make this protocol practical various form of artificial intelligence are needed I will argue that the most important form are one novel unsupervised learning methods that collaborate with domain experts to better understand data and generate multiple competing geological hypotheses and two humanintheloop AI algorithms that can optimally plan various geological geophysical geochemical and drilling data acquisition where uncertainty reduction of geological hypothesis precedes the uncertainty reduction on grade and tonnage

AI Summary

Efficacy of information (EI): a metric that quantifies how much future data will reduce uncertainty on average on some quantity of interest. [3]
The author advocates for a new scientific method for mineral exploration, focusing on decision-making rather than traditional geophysical inversion. [2]
Epistemic uncertainty: the lack of understanding we still have about the nature of orebodies. [1]

AGI: Artificial General Intelligence

The Geometry of Benchmarks: A New Path Toward AGI

ulamai

Rate paper: 👍 👎 ♥ Save

Abstract
Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient $κ$ as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for $κ> 0$. Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.

AI Summary

GVU Dynamics: a formalism that connects static geometry to learning, showing that many contemporary training procedures are special cases of reinforcement learning on the moduli space. [3]
Self-Improvement Coefficient κ: a measure of the rate of change of an agent's capability trajectory over time. [3]
autonomous AI scale moduli space of batteries GVU dynamics self-improvement coefficient κ variance inequality Autonomous AI Scale: a framework for evaluating autonomous AI systems based on performance thresholds on families of batteries. [2]

Deep Learning

Sparse Computations in Deep Learning Inference

National Technical Univer

Rate paper: 👍 👎 ♥ Save

Abstract
The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.

AI Summary

The text discusses various aspects of deep learning, including model architecture, training, optimization, and inference. [3]
Model Training: The process that makes a DNN learn to perform a specific task, much like a student learns from practice and correction. [3]
Batch Training: Instead of feeding individual data points one by one, models are trained on small groups of samples called batches. [3]
Training often requires many epochs to fully learn the data’s patterns. [3]
The text concludes that deep learning involves various steps from model architecture to inference, and optimization is crucial for efficient deployment of DNNs. [3]
The text mentions several deep learning frameworks such as PyTorch, TensorFlow, JAX, and Hugging Face Hub. [3]
Deep learning involves various steps from model architecture to inference, and optimization is crucial for efficient deployment of DNNs. [3]
But, just like how you need to practice and get better at recognizing cats, the computer needs to be trained and optimized so that it can perform well in real-world situations. [3]
Epochs: A single pass through the entire dataset is called an epoch. [2]
The text does not provide a clear explanation of the differences between various model representations such as ONNX, TorchScript, TensorFlow SavedModel / GraphDef, etc. [1]

Weight Space Representation Learning with Neural Fields

EPFL

Rate paper: 👍 👎 ♥ Save

Abstract
In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pre-trained base model and low-rank adaptation (LoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that multiplicative LoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, multiplicative LoRA weights enable higher-quality generation than existing weight-space methods.

Help us improve your experience!