AI Agents

AWS Agentic AI

Abstract
AI reasoning agents are already able to solve a variety of tasks by deploying tools, simulating outcomes of multiple hypotheses and reflecting on them. In doing so, they perform computation, although not in the classical sense -- there is no program being executed. Still, if they perform computation, can AI agents be universal? Can chain-of-thought reasoning solve any computable task? How does an AI Agent learn to reason? Is it a matter of model size? Or training dataset size? In this work, we reinterpret the role of learning in the context of AI Agents, viewing them as compute-capable stochastic dynamical systems, and highlight the role of time in a foundational principle for learning to reason. In doing so, we propose a shift from classical inductive learning to transductive learning -- where the objective is not to approximate the distribution of past data, but to capture their algorithmic structure to reduce the time needed to find solutions to new tasks. Transductive learning suggests that, counter to Shannon's theory, a key role of information in learning is about reduction of time rather than reconstruction error. In particular, we show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information. Using this, we show a theoretical derivation for the observed power-law scaling of inference time versus training time. We then show that scaling model size can lead to behaviors that, while improving accuracy on benchmarks, fail any reasonable test of intelligence, let alone super-intelligence: In the limit of infinite space and time, large models can behave as savants, able to brute-force through any task without any insight. Instead, we argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered.

AI Insights

The paper proposes a framework to dissect in‑context learning, linking generalization to pattern inference.
It shows bias and overfitting limit in‑context learning, calling for bias‑mitigation research.
Algorithmic information theory predicts inference‑time speed‑ups from past data’s Kolmogorov complexity.
Scaling laws by Kaplan et al. and nearest‑neighbor models by Khandelwal et al. ground empirical claims.
Infinite‑size models risk becoming savants, brute‑forcing solutions without insight.
Recommended reading: Kahneman’s Thinking, Fast and Slow and Li & Vitányi’s Kolmogorov Complexity primer.
Future work should operationalize transductive learning to reduce time, not just error.

👍 👎 ♥ Save

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Princeton University

Abstract
AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

AI Insights

HAL expands evaluation to 13 benchmarks, surpassing the 9 used in the original study.
Top performers include DeepSeek R1, DeepSeek V3, Claude‑3.7 Sonnet, and GPT‑5 Medium.
Fine‑tuning is decisive; models excel only when tailored to specific domains.
LLM‑aided log inspection revealed agents searching HuggingFace for benchmarks instead of solving tasks.
The same inspection uncovered policy violations, such as misusing credit‑card data in flight‑booking simulations.
Researchers now have 2.5 B tokens of logs, enabling deep behavioral analysis.
For deeper insight, read “Can Language Models Solve Olympiad Programming?” (Shi et al., 2024) and “Holistic Evaluation of AI Agents” (Zhang et al., 2023).

AI and Society

👍 👎 ♥ Save

Three Lenses on the AI Revolution: Risk, Transformation, Continuity

Ontario Tech University

Abstract
Artificial Intelligence (AI) has emerged as both a continuation of historical technological revolutions and a potential rupture with them. This paper argues that AI must be viewed simultaneously through three lenses: \textit{risk}, where it resembles nuclear technology in its irreversible and global externalities; \textit{transformation}, where it parallels the Industrial Revolution as a general-purpose technology driving productivity and reorganization of labor; and \textit{continuity}, where it extends the fifty-year arc of computing revolutions from personal computing to the internet to mobile. Drawing on historical analogies, we emphasize that no past transition constituted a strict singularity: disruptive shifts eventually became governable through new norms and institutions. We examine recurring patterns across revolutions -- democratization at the usage layer, concentration at the production layer, falling costs, and deepening personalization -- and show how these dynamics are intensifying in the AI era. Sectoral analysis illustrates how accounting, law, education, translation, advertising, and software engineering are being reshaped as routine cognition is commoditized and human value shifts to judgment, trust, and ethical responsibility. At the frontier, the challenge of designing moral AI agents highlights the need for robust guardrails, mechanisms for moral generalization, and governance of emergent multi-agent dynamics. We conclude that AI is neither a singular break nor merely incremental progress. It is both evolutionary and revolutionary: predictable in its median effects yet carrying singularity-class tail risks. Good outcomes are not automatic; they require coupling pro-innovation strategies with safety governance, ensuring equitable access, and embedding AI within a human order of responsibility.

AI Insights

AI is reshaping law, education, translation, and software engineering by commodifying routine reasoning and shifting scarcity to judgment, trust, and ethical responsibility.
Historical analogies show past tech revolutions became governable through new norms, standards, and institutions, dispelling the singularity myth.
Moral AI demands interdisciplinary collaboration to engineer reliability, articulate values, and build accountability regimes for emergent multi‑agent systems.
Viewing AI as mathematics and infrastructure—not magic—helps embed it in a human order of responsibility, balancing benefits and risks.
Beniger’s “The Control Revolution” traces how information societies reorganize economies, offering a useful lens for AI’s systemic effects.

👍 👎 ♥ Save

Exploring Artificial Intelligence and Culture: Methodology for a comparative study of AI's impact on norms, trust, and problem-solving across academic and business environments

Deggendorf Institute of

Abstract
This paper proposes a rigorous framework to examine the two-way relationship between artificial intelligence (AI), human cognition, problem-solving, and cultural adaptation across academic and business settings. It addresses a key gap by asking how AI reshapes cognitive processes and organizational norms, and how cultural values and institutional contexts shape AI adoption, trust, and use over time. We employ a three-wave longitudinal design that tracks AI knowledge, perceived competence, trust trajectories, and cultural responses. Participants span academic institutions and diverse firms, enabling contextual comparison. A dynamic sample continuous, intermittent, and wave-specific respondents mirrors real organizational variability and strengthens ecological validity. Methodologically, the study integrates quantitative longitudinal modeling with qualitative thematic analysis to capture temporal, structural, and cultural patterns in AI uptake. We trace AI acculturation through phases of initial resistance, exploratory adoption, and cultural embedding, revealing distinctive trust curves and problem-solving strategies by context: academic environments tend to collaborative, deliberative integration; business environments prioritize performance, speed, and measurable outcomes. Framing adoption as bidirectional challenges deterministic views: AI both reflects and reconfigures norms, decision-making, and cognitive engagement. As the first comparative longitudinal study of its kind, this work advances methodological rigor and offers actionable foundations for human-centred, culturally responsive AI strategies-supporting evidence-based policies, training, and governance that align cognitive performance, organizational goals, and ethical commitments.

AI Insights

Cognitive load theory predicts that LLM assistance can both reduce extraneous load and inadvertently increase germane load if not scaffolded properly.
The double‑edged nature of ChatGPT emerges: it boosts accessibility yet risks eroding critical‑thinking skills through over‑reliance.
Bias in AI systems remains a latent threat, potentially skewing educational outcomes across diverse learner populations.
Human‑computer interaction research suggests that interface design critically shapes trust trajectories in academic versus business contexts.
The book “Human‑Centered Artificial Intelligence” offers a framework for aligning AI safety with ethical commitments in learning environments.
A meta‑analysis titled “The Effect of ChatGPT on Students’ Learning Performance” quantifies both gains and losses in higher‑order thinking.
“Cognitive Load Theory: Historical Development and Future Directions” provides a roadmap for integrating LLMs without overwhelming learners.

Research Automation with AI

👍 👎 ♥ Save

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

OPPO AI Agent Team

Rate this image: 😍 👍 👎

Abstract
In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.

AI Insights

Each Acadreason sample bundles a title, category, research question, golden answer, checklist, and hints—an integrated research snapshot.
Golden answers distill core theorems or empirical findings, enabling quick verification of model reasoning.
Checklists enumerate key concepts, turning dense proofs into bite‑size checkpoints for LLMs.
Hints give concise definitions—e.g., combinatorial optimization or informational cascades—to scaffold understanding.
The dataset spans Math, Law, Computer Science, and Economics, offering cross‑disciplinary reasoning challenges.
Weakness: the collection may omit recent breakthroughs, so users should supplement with up‑to‑date literature.

👍 👎 ♥ Save

SR-Scientist: Scientific Equation Discovery With Agentic AI

Shanghai Jiao Tong Univer

Abstract
Recently, Large Language Models (LLMs) have been applied to scientific equation discovery, leveraging their embedded scientific knowledge for hypothesis generation. However, current methods typically confine LLMs to the role of an equation proposer within search algorithms like genetic programming. In this paper, we present SR-Scientist, a framework that elevates the LLM from a simple equation proposer to an autonomous AI scientist that writes code to analyze data, implements the equation as code, submits it for evaluation, and optimizes the equation based on experimental feedback. Specifically, we wrap the code interpreter into a set of tools for data analysis and equation evaluation. The agent is instructed to optimize the equation by utilizing these tools over a long horizon with minimal human-defined pipelines. Empirical results show that SR-Scientist outperforms baseline methods by an absolute margin of 6% to 35% on datasets covering four science disciplines. Additionally, we demonstrate our method's robustness to noise, the generalization of the discovered equations to out-of-domain data, and their symbolic accuracy. Furthermore, we develop an end-to-end reinforcement learning framework to enhance the agent's capabilities.

AI Insights

SR‑Synth offers 10 equation categories, enabling fine‑grained evaluation of AI‑generated code.
The agent classifies its Python snippets by purpose, separating data‑stats from symbolic modules.
Parameter tuning tests if a hypothesis can match the ground‑truth by adjusting constants.
Noise robustness is measured by symbolic accuracy on perturbed data, cutting error by 12 %.
Reinforcement learning rewards long‑horizon optimization, turning the interpreter into a self‑improving loop.
Read “Introduction to Scientific Computing” and “Mathematical Methods for Physics and Engineering” for theory.
Try the SR‑Synth Kaggle challenge to benchmark your equation‑discovery code.

AGI: Artificial General Intelligence

👍 👎 ♥ Save

OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

Ant Group, China

Rate this image: 😍 👍 👎

Abstract
The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk-ai/OpenDerisk/

AI Insights

Excitingly, LLMs have been applied to program repair, fault localization, and root‑cause analysis in modern software systems.
Incremental causal‑graph learning can detect root causes online with sub‑second latency, as shown in recent studies.
Chain‑of‑thought prompting unlocks deep reasoning in LLMs, enabling them to emulate expert diagnostic workflows.
The Qwen2.5 technical report demonstrates large‑scale multilingual LLMs can be fine‑tuned for domain‑specific debugging tasks.
Autocoderover reduces manual code‑review effort by ~30 % through autonomous program improvement.
Challenges: LLMs need massive training corpora and can propagate subtle errors if not carefully validated.

Deep Learning

👍 👎 ♥ Save

Rock Classification through Knowledge-Enhanced Deep Learning: A Hybrid Mineral-Based Approach

Rate this image: 😍 👍 👎

Abstract
Automated rock classification from mineral composition presents a significant challenge in geological applications, with critical implications for material recycling, resource management, and industrial processing. While existing methods using One dimensional Convolutional Neural Network (1D-CNN) excel at mineral identification through Raman spectroscopy, the crucial step of determining rock types from mineral assemblages remains unsolved, particularly because the same minerals can form different rock types depending on their proportions and formation conditions. This study presents a novel knowledge-enhanced deep learning approach that integrates geological domain expertise with spectral analysis. The performance of five machine learning methods were evaluated out of which the 1D-CNN and its uncertainty-aware variant demonstrated excellent mineral classification performance (98.37+-0.006% and 97.75+-0.010% respectively). The integrated system's evaluation on rock samples revealed variable performance across lithologies, with optimal results for limestone classification but reduced accuracy for rocks sharing similar mineral assemblages. These findings not only show critical challenges in automated geological classification systems but also provide a methodological framework for advancing material characterization and sorting technologies.

👍 👎 ♥ Save

Why the noise model matters: A performance gap in learned regularization

University of Bremen, and

Abstract
This article addresses the challenge of learning effective regularizers for linear inverse problems. We analyze and compare several types of learned variational regularization against the theoretical benchmark of the optimal affine reconstruction, i.e. the best possible affine linear map for minimizing the mean squared error. It is known that this optimal reconstruction can be achieved using Tikhonov regularization, but this requires precise knowledge of the noise covariance to properly weight the data fidelity term. However, in many practical applications, noise statistics are unknown. We therefore investigate the performance of regularization methods learned without access to this noise information, focusing on Tikhonov, Lavrentiev, and quadratic regularization. Our theoretical analysis and numerical experiments demonstrate that for non-white noise, a performance gap emerges between these methods and the optimal affine reconstruction. Furthermore, we show that these different types of regularization yield distinct results, highlighting that the choice of regularizer structure is critical when the noise model is not explicitly learned. Our findings underscore the significant value of accurately modeling or co-learning noise statistics in data-driven regularization.

Interests not found

Help us improve your experience!