Hi!

Your personalized paper recommendations for 10 to 14 November, 2025.

🎯 Top Personalized Recommendations

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations

Kamiwaza AI

Why we think this paper is great for you:
This paper offers crucial insights into establishing reliable evaluation methods for agentic AI systems, which is essential for successful enterprise adoption and deployment. It will help you understand how to effectively benchmark AI for real-world scenarios.

Rate paper: 👍 👎 ♥ Save

Abstract
Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as multi-step tool use and decision-making under uncertainty. We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation. Through 170,000 LLM test items processing over 5.5 billion tokens across 35 model configurations, we demonstrate that traditional benchmark rankings poorly predict practical agentic performance. Notably, newer generation models like Llama 4 or Qwen 3 do not always outperform their older generation variants on enterprise-relevant tasks, contradicting traditional benchmark trends. We also present insights on cost-performance tradeoffs, model-specific behavioral patterns, and the impact of reasoning capabilities on token efficiency -- findings critical for enterprises making deployment decisions.

AI Summary

Kamiwaza Agentic Merit Index (KAMI): An enterprise-focused benchmark designed to provide a standard, contamination-resistant measure of real-world, enterprise-relevant agentic AI capability, analogous to the SPEC CPU benchmark. [3]
Traditional LLM benchmarks, including aggregated intelligence indices, are poor predictors of real-world agentic performance in enterprise-relevant tasks due to data contamination and "agentic disconnect." Newer generation LLMs do not consistently outperform older variants on practical enterprise agentic tasks, challenging conventional benchmark-driven model selection and highlighting the need for specialized evaluation. [2]
Implementing agentic AI evaluations requires robust sandboxing and isolated execution environments (e.g., Docker containers) to prevent chaotic LLM behavior from corrupting infrastructure or causing cascading failures. [2]
Reasoning-enabled models can significantly boost accuracy for smaller LLMs but incur substantial costs in terms of token usage and wall-clock time, necessitating careful cost-performance trade-off analysis for enterprise deployment. [2]
Evaluating agentic AI for enterprise adoption should prioritize metrics beyond simple accuracy, including run-to-run reliability (standard deviation, confidence intervals) to account for LLM stochasticity and ensure consistent performance. [2]
A two-stage evaluation approach is recommended for enterprises: an initial rapid screening with fewer trials, followed by a more rigorous, higher-replication evaluation for top candidate models to optimize resource utilization. [2]
The PICARD framework effectively combats benchmark data contamination by randomizing variables and sandbox data, ensuring LLMs cannot memorize test items and enabling true agentic assessment. [2]
Agentic Disconnect: The discrepancy where standard LLM benchmarks measure capabilities different from what enterprises require, such as multi-step tool use, decision-making under uncertainty, and real-world task completion. [2]
Benchmark Data Contamination: The phenomenon where benchmark test data is inadvertently or intentionally included in an LLM's training data, leading to inflated benchmark scores that do not reflect genuine capability or real-world performance. [2]

Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents

Writer, Inc

Why we think this paper is great for you:
You'll find this highly relevant for understanding how to evaluate AI agents based on their decision quality and operational autonomy, moving beyond mere infrastructural metrics. This perspective is vital for managing AI projects effectively.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
As AI agents proliferate across industries and applications, evaluating their performance based solely on infrastructural metrics such as latency, time-to-first-token, or token throughput is proving insufficient. These metrics fail to capture the quality of an agent's decisions, its operational autonomy, or its ultimate business value. This white paper proposes a novel, comprehensive framework of eleven outcome-based, task-agnostic performance metrics for AI agents that transcend domain boundaries. These metrics are designed to enable organizations to evaluate agents based on the quality of their decisions, their degree of autonomy, their adaptability to new challenges, and the tangible business value they deliver, regardless of the underlying model architecture or specific use case. We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE). Through a large-scale simulated experiment involving four distinct agent architectures (ReAct, Chain-of-Thought, Tool-Augmented, Hybrid) across five diverse domains (Healthcare, Finance, Marketing, Legal, and Customer Service), we demonstrate the framework's efficacy. Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model across the majority of our proposed metrics, achieving an average Goal Completion Rate of 88.8\% and the highest Return on Investment (ROI). This work provides a robust, standardized methodology for the holistic evaluation of AI agents, paving the way for more effective development, deployment, and governance.

The Long Shadow of Superstars: Effects on Opportunities, Careers, and Team Production

Why we think this paper is great for you:
This study directly addresses the dynamics of team production and the impact of individual performance on team learning opportunities. It provides valuable insights for managing high-performing technical teams.

Rate paper: 👍 👎 ♥ Save

Abstract
Superstars often dominate key tasks because of their exceptional abilities, but this concentration of responsibility may unintentionally limit on-the-job learning opportunities for others. Using panel data from Major League Baseball (MLB), this study examines how superstar presence affects teammates' opportunities and career outcomes. To address potential endogeneity in team composition, we exploit plausibly exogenous variation in superstar availability caused by injuries. When a superstar is active in the same team-position unit, non-star teammates play significantly less. These short-term reductions in playing time extend to longer horizons: players who begin their careers alongside a superstar who remains active for a full season (i.e., not on the injured list) are about 1.7 times more likely to exit MLB earlier than comparable peers. A key mechanism is reduced skill development -- limited playing opportunities hinder subsequent growth in offensive performance. At the team level, greater dependence on superstars raises immediate productivity but magnifies performance declines after their departure, indicating a trade-off between short-term success and long-term adaptability. Overall, the findings suggest that while concentrating key roles in top performers boosts output in the short run, it can restrict others' development and retention. Similar dynamics may arise in other organizations that rely heavily on a few exceptional individuals.

Alignment Debt: The Hidden Work of Making AI Usable

YUX Design

Why we think this paper is great for you:
This paper sheds light on the often-overlooked challenges and 'hidden work' required to make AI systems truly usable and adaptable in diverse contexts. It offers a critical perspective for managing AI deployment and user experience.

Rate paper: 👍 👎 ♥ Save

Abstract
Frontier LLMs are optimised around high-resource assumptions about language, knowledge, devices, and connectivity. Whilst widely accessible, they often misfit conditions in the Global South. As a result, users must often perform additional work to make these systems usable. We term this alignment debt: the user-side burden that arises when AI systems fail to align with cultural, linguistic, infrastructural, or epistemic contexts. We develop and validate a four-part taxonomy of alignment debt through a survey of 411 AI users in Kenya and Nigeria. Among respondents measurable on this taxonomy (n = 385), prevalence is: Cultural and Linguistic (51.9%), Infrastructural (43.1%), Epistemic (33.8%), and Interaction (14.0%). Country comparisons show a divergence in Infrastructural and Interaction debt, challenging one-size-fits-Africa assumptions. Alignment debt is associated with compensatory labour, but responses vary by debt type: users facing Epistemic challenges verify outputs at significantly higher rates (91.5% vs. 80.8%; p = 0.037), and verification intensity correlates with cumulative debt burden (Spearmans rho = 0.147, p = 0.004). In contrast, Infrastructural and Interaction debts show weak or null associations with verification, indicating that some forms of misalignment cannot be resolved through verification alone. These findings show that fairness must be judged not only by model metrics but also by the burden imposed on users at the margins, compelling context-aware safeguards that alleviate alignment debt in Global South settings. The alignment debt framework provides an empirically grounded way to measure user burden, informing both design practice and emerging African AI governance efforts.

Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects

UFMG

Why we think this paper is great for you:
You will gain practical insights into configuring and optimizing AI coding agents to enhance productivity in software engineering tasks. This is highly relevant for leveraging AI within your technical teams.

Rate paper: 👍 👎 ♥ Save

Abstract
Agentic code assistants are a new generation of AI systems capable of performing end-to-end software engineering tasks. While these systems promise unprecedented productivity gains, their behavior and effectiveness depend heavily on configuration files that define architectural constraints, coding practices, and tool usage policies. However, little is known about the structure and content of these configuration artifacts. This paper presents an empirical study of the configuration ecosystem of Claude Code, one of the most widely used agentic coding systems. We collected and analyzed 328 configuration files from public Claude Code projects to identify (i) the software engineering concerns and practices they specify and (ii) how these concerns co-occur within individual files. The results highlight the importance of defining a wide range of concerns and practices in agent configuration files, with particular emphasis on specifying the architecture the agent should follow.

AI-Powered Data Visualization Platform: An Intelligent Web Application for Automated Dataset Analysis

Presidency University

Why we think this paper is great for you:
This paper presents a practical application of AI for automating data analysis and generating interactive visualizations. It offers a direct example of AI's impact on data science workflows.

Rate paper: 👍 👎 ♥ Save

Abstract
An AI-powered data visualization platform that automates the entire data analysis process, from uploading a dataset to generating an interactive visualization. Advanced machine learning algorithms are employed to clean and preprocess the data, analyse its features, and automatically select appropriate visualizations. The system establishes the process of automating AI-based analysis and visualization from the context of data-driven environments, and eliminates the challenge of time-consuming manual data analysis. The combination of a Python Flask backend to access the dataset, paired with a React frontend, provides a robust platform that automatically interacts with Firebase Cloud Storage for numerous data processing and data analysis solutions and real-time sources. Key contributions include automatic and intelligent data cleaning, with imputation for missing values, and detection of outliers, via analysis of the data set. AI solutions to intelligently select features, using four different algorithms, and intelligent title generation and visualization are determined by the attributes of the dataset. These contributions were evaluated using two separate datasets to assess the platform's performance. In the process evaluation, the initial analysis was performed in real-time on datasets as large as 100000 rows, while the cloud-based demand platform scales to meet requests from multiple users and processes them simultaneously. In conclusion, the cloud-based data visualization application allowed for a significant reduction of manual inputs to the data analysis process while maintaining a high quality, impactful visual outputs, and user experiences

Cortex AISQL: A Production SQL Engine for Unstructured Data

Snowflake Inc

Why we think this paper is great for you:
This paper introduces a production-ready SQL engine that integrates semantic reasoning for querying both structured and unstructured data. It's highly relevant for advancing your data science engineering capabilities.

Rate paper: 👍 👎 ♥ Save

Abstract
Snowflake's Cortex AISQL is a production SQL engine that integrates native semantic operations directly into SQL. This integration allows users to write declarative queries that combine relational operations with semantic reasoning, enabling them to query both structured and unstructured data effortlessly. However, making semantic operations efficient at production scale poses fundamental challenges. Semantic operations are more expensive than traditional SQL operations, possess distinct latency and throughput characteristics, and their cost and selectivity are unknown during query compilation. Furthermore, existing query engines are not designed to optimize semantic operations. The AISQL query execution engine addresses these challenges through three novel techniques informed by production deployment data from Snowflake customers. First, AI-aware query optimization treats AI inference cost as a first-class optimization objective, reasoning about large language model (LLM) cost directly during query planning to achieve 2-8$\times$ speedups. Second, adaptive model cascades reduce inference costs by routing most rows through a fast proxy model while escalating uncertain cases to a powerful oracle model, achieving 2-6$\times$ speedups while maintaining 90-95% of oracle model quality. Third, semantic join query rewriting lowers the quadratic time complexity of join operations to linear through reformulation as multi-label classification tasks, achieving 15-70$\times$ speedups with often improved prediction quality. AISQL is deployed in production at Snowflake, where it powers diverse customer workloads across analytics, search, and content understanding.

Data Science Management

HyProv: Hybrid Provenance Management for Scientific Workflows

Humboldt University

Rate paper: 👍 👎 ♥ Save

Abstract
Provenance plays a crucial role in scientific workflow execution, for instance by providing data for failure analysis, real-time monitoring, or statistics on resource utilization for right-sizing allocations. The workflows themselves, however, become increasingly complex in terms of involved components. Furthermore, they are executed on distributed cluster infrastructures, which makes the real-time collection, integration, and analysis of provenance data challenging. Existing provenance systems struggle to balance scalability, real-time processing, online provenance analytics, and integration across different components and compute resources. Moreover, most provenance solutions are not workflow-aware; by focusing on arbitrary workloads, they miss opportunities for workflow systems where optimization and analysis can exploit the availability of a workflow specification that dictates, to some degree, task execution orders and provides abstractions for physical tasks at a logical level. In this paper, we present HyProv, a hybrid provenance management system that combines centralized and federated paradigms to offer scalable, online, and workflow-aware queries over workflow provenance traces. HyProv uses a centralized component for efficient management of the small and stable workflow-specification-specific provenance, and complements this with federated querying over different scalable monitoring and provenance databases for the large-scale execution logs. This enables low-latency access to current execution data. Furthermore, the design supports complex provenance queries, which we exemplify for the workflow system Airflow in combination with the resource manager Kubernetes. Our experiments indicate that HyProv scales to large workflows, answers provenance queries with sub-second latencies, and adds only modest CPU and memory overhead to the cluster.

AI and Society

Rethinking Science in the Age of Artificial Intelligence

Los Alamos National Lab

Rate paper: 👍 👎 ♥ Save

Abstract
Artificial intelligence (AI) is reshaping how research is conceived, conducted, and communicated across fields from chemistry to biomedicine. This commentary examines how AI is transforming the research workflow. AI systems now help researchers manage the information deluge, filtering the literature, surfacing cross-disciplinary links for ideas and collaborations, generating hypotheses, and designing and executing experiments. These developments mark a shift from AI as a mere computational tool to AI as an active collaborator in science. Yet this transformation demands thoughtful integration and governance. We argue that at this time AI must augment but not replace human judgment in academic workflows such as peer review, ethical evaluation, and validation of results. This paper calls for the deliberate adoption of AI within the scientific practice through policies that promote transparency, reproducibility, and accountability.

Research Automation with AI

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

Rate paper: 👍 👎 ♥ Save

Abstract
Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7\%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM's solving accuracy by up to $4.2\%$. Remarkably, OR-R1 outperforms ORLM by over $2.4\%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1\%-6.4\%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13\%$ to $7\%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

AgenticSciML: Collaborative Multi-Agent Systems for Emergent Discovery in Scientific Machine Learning

Brown University

Rate paper: 👍 👎 ♥ Save

Abstract
Scientific Machine Learning (SciML) integrates data-driven inference with physical modeling to solve complex problems in science and engineering. However, the design of SciML architectures, loss formulations, and training strategies remains an expert-driven research process, requiring extensive experimentation and problem-specific insights. Here we introduce AgenticSciML, a collaborative multi-agent system in which over 10 specialized AI agents collaborate to propose, critique, and refine SciML solutions through structured reasoning and iterative evolution. The framework integrates structured debate, retrieval-augmented method memory, and ensemble-guided evolutionary search, enabling the agents to generate and assess new hypotheses about architectures and optimization procedures. Across physics-informed learning and operator learning tasks, the framework discovers solution methods that outperform single-agent and human-designed baselines by up to four orders of magnitude in error reduction. The agents produce novel strategies -- including adaptive mixture-of-expert architectures, decomposition-based PINNs, and physics-informed operator learning models -- that do not appear explicitly in the curated knowledge base. These results show that collaborative reasoning among AI agents can yield emergent methodological innovation, suggesting a path toward scalable, transparent, and autonomous discovery in scientific computing.

AGI: Artificial General Intelligence

Intilligence Foundation Model: A New Perspective to Approach Artificial General Intelligence

Rate paper: 👍 👎 ♥ Save

Abstract
We propose a new perspective for approaching artificial general intelligence (AGI) through an intelligence foundation model (IFM). Unlike existing foundation models (FMs), which specialize in pattern learning within specific domains such as language, vision, or time series, IFM aims to acquire the underlying mechanisms of intelligence by learning directly from diverse intelligent behaviors. Vision, language, and other cognitive abilities are manifestations of intelligent behavior; learning from this broad range of behaviors enables the system to internalize the general principles of intelligence. Based on the fact that intelligent behaviors emerge from the collective dynamics of biological neural systems, IFM consists of two core components: a novel network architecture, termed the state neural network, which captures neuron-like dynamic processes, and a new learning objective, neuron output prediction, which trains the system to predict neuronal outputs from collective dynamics. The state neural network emulates the temporal dynamics of biological neurons, allowing the system to store, integrate, and process information over time, while the neuron output prediction objective provides a unified computational principle for learning these structural dynamics from intelligent behaviors. Together, these innovations establish a biologically grounded and computationally scalable foundation for building systems capable of generalization, reasoning, and adaptive learning across domains, representing a step toward truly AGI.

Deep Learning

Torch-Uncertainty: A Deep Learning Framework for Uncertainty Quantification

ENSTA Paris

Rate paper: 👍 👎 ♥ Save

Abstract
Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify the uncertainty of their predictions, limiting their broader adoption in critical real-world applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methods to improve the reliability of uncertainty estimates. Although numerous techniques have been proposed, a unified tool offering a seamless workflow to evaluate and integrate these methods remains lacking. To bridge this gap, we introduce Torch-Uncertainty, a PyTorch and Lightning-based framework designed to streamline DNN training and evaluation with UQ techniques and metrics. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at https://github.com/ENSTA-U2IS-AI/Torch-Uncertainty

Deep Neural Operator Learning for Probabilistic Models

University of Michigan

Rate paper: 👍 👎 ♥ Save

Abstract
We propose a deep neural-operator framework for a general class of probability models. Under global Lipschitz conditions on the operator over the entire Euclidean space-and for a broad class of probabilistic models-we establish a universal approximation theorem with explicit network-size bounds for the proposed architecture. The underlying stochastic processes are required only to satisfy integrability and general tail-probability conditions. We verify these assumptions for both European and American option-pricing problems within the forward-backward SDE (FBSDE) framework, which in turn covers a broad class of operators arising from parabolic PDEs, with or without free boundaries. Finally, we present a numerical example for a basket of American options, demonstrating that the learned model produces optimal stopping boundaries for new strike prices without retraining.

We did not find tons of content matching your interests we've included some additional topics that are popular. Also be aware that if the topics is not present in arxiv we wont be able to recommend it.

AI Agents

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations

Kamiwaza AI

Rate paper: 👍 👎 ♥ Save

AI Summary

Kamiwaza Agentic Merit Index (KAMI): An enterprise-focused benchmark designed to provide a standard, contamination-resistant measure of real-world, enterprise-relevant agentic AI capability, analogous to the SPEC CPU benchmark. [3]
Traditional LLM benchmarks, including aggregated intelligence indices, are poor predictors of real-world agentic performance in enterprise-relevant tasks due to data contamination and "agentic disconnect." Newer generation LLMs do not consistently outperform older variants on practical enterprise agentic tasks, challenging conventional benchmark-driven model selection and highlighting the need for specialized evaluation. [2]
Implementing agentic AI evaluations requires robust sandboxing and isolated execution environments (e.g., Docker containers) to prevent chaotic LLM behavior from corrupting infrastructure or causing cascading failures. [2]
Reasoning-enabled models can significantly boost accuracy for smaller LLMs but incur substantial costs in terms of token usage and wall-clock time, necessitating careful cost-performance trade-off analysis for enterprise deployment. [2]
Evaluating agentic AI for enterprise adoption should prioritize metrics beyond simple accuracy, including run-to-run reliability (standard deviation, confidence intervals) to account for LLM stochasticity and ensure consistent performance. [2]
A two-stage evaluation approach is recommended for enterprises: an initial rapid screening with fewer trials, followed by a more rigorous, higher-replication evaluation for top candidate models to optimize resource utilization. [2]
The PICARD framework effectively combats benchmark data contamination by randomizing variables and sandbox data, ensuring LLMs cannot memorize test items and enabling true agentic assessment. [2]
Agentic Disconnect: The discrepancy where standard LLM benchmarks measure capabilities different from what enterprises require, such as multi-step tool use, decision-making under uncertainty, and real-world task completion. [2]
Benchmark Data Contamination: The phenomenon where benchmark test data is inadvertently or intentionally included in an LLM's training data, leading to inflated benchmark scores that do not reflect genuine capability or real-world performance. [2]

Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects

UFMG

Rate paper: 👍 👎 ♥ Save

AI and Society

Alignment Debt: The Hidden Work of Making AI Usable

YUX Design

Rate paper: 👍 👎 ♥ Save

Rethinking Science in the Age of Artificial Intelligence

Los Alamos National Lab

Rate paper: 👍 👎 ♥ Save

Research Automation with AI

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

Rate paper: 👍 👎 ♥ Save

AgenticSciML: Collaborative Multi-Agent Systems for Emergent Discovery in Scientific Machine Learning

Brown University

Rate paper: 👍 👎 ♥ Save

AGI: Artificial General Intelligence

Intilligence Foundation Model: A New Perspective to Approach Artificial General Intelligence

Rate paper: 👍 👎 ♥ Save

Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents

Writer, Inc

Rate paper: 👍 👎 ♥ Save

Deep Learning

Torch-Uncertainty: A Deep Learning Framework for Uncertainty Quantification

ENSTA Paris

Rate paper: 👍 👎 ♥ Save

Deep Neural Operator Learning for Probabilistic Models

University of Michigan

Rate paper: 👍 👎 ♥ Save

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

Engineering Management
AI for Data Science Engineering
AI for Data Science Management
Data Science Engineering Management
Managing teams of data scientists

You can edit or add more interests any time.

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback