Hi!

Your personalized paper recommendations for 08 to 12 December, 2025.

🎯 Top Personalized Recommendations

Evolving Excellence: Automated Optimization of LLM-based Agents

TurinTech AI

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

The system's ability to discover non-obvious optimizations through semantic mutations demonstrates the value of evolutionary approaches for natural language components. [3]
The system's ability to discover non-obvious optimizations through semantic mutations demonstrates the value of evolutionary approaches for natural language components. [3]
The mixed results also highlight that automated optimization is not universally beneficial; practitioners should assess their agents' baseline quality and task characteristics before investing in optimization efforts. [3]
Artemis is a framework that uses evolutionary techniques to optimize the performance of agents, particularly those with clear performance metrics and room for improvement. [3]
Artemis is a practical framework for automated agent optimization. [2]
Evolutionary prompt engineering: a method of optimizing the performance of agents by modifying their input prompts. [1]

Abstract
Agentic AI systems built on large language models (LLMs) offer significant potential for automating complex workflows, from software development to customer support. However, LLM agents often underperform due to suboptimal configurations; poorly tuned prompts, tool descriptions, and parameters that typically require weeks of manual refinement. Existing optimization methods either are too complex for general use or treat components in isolation, missing critical interdependencies. We present ARTEMIS, a no-code evolutionary optimization platform that jointly optimizes agent configurations through semantically-aware genetic operators. Given only a benchmark script and natural language goals, ARTEMIS automatically discovers configurable components, extracts performance signals from execution logs, and evolves configurations without requiring architectural modifications. We evaluate ARTEMIS on four representative agent systems: the \emph{ALE Agent} for competitive programming on AtCoder Heuristic Contest, achieving a \textbf{$13.6\%$ improvement} in acceptance rate; the \emph{Mini-SWE Agent} for code optimization on SWE-Perf, with a statistically significant \textbf{10.1\% performance gain}; and the \emph{CrewAI Agent} for cost and mathematical reasoning on Math Odyssey, achieving a statistically significant \textbf{$36.9\%$ reduction} in the number of tokens required for evaluation. We also evaluate the \emph{MathTales-Teacher Agent} powered by a smaller open-source model (Qwen2.5-7B) on GSM8K primary-level mathematics problems, achieving a \textbf{22\% accuracy improvement} and demonstrating that ARTEMIS can optimize agents based on both commercial and local models.

Why we think this paper is great for you:
This paper directly addresses the core interest in LLMs for AI agents, focusing on optimization – a key area for improving their performance and capabilities. It offers insights into tuning LLM agents, aligning with the user's focus on enhancing agentic AI systems.

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Kamiwaza AI

Rate paper: 👍 👎 ♥ Save

AI Summary

Granite 4 Small shows resilience and adaptability in database/SQLite tasks, but struggles with CSV tasks Model's success rates vary across different question series (400-500) Common failure patterns include constraint abandonment, schema-guessing loop, SQL semantic misinterpretation, and resignation after schema retrieval Agentic AI deployment engineers should design tools that make it obvious for agents which to choose for particular scenarios Hits behavior can be mitigated by including explicit instructions in the system prompt or tool definitions/task instructions Granite 4 Small's performance highlights the importance of designing intuitive and user-friendly interfaces for AI models Further research is needed to understand the model's strengths and weaknesses, as well as to develop strategies for improving its performance Constraint abandonment Schema-guessing loop SQL semantic misinterpretation [2]

Abstract
We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

Why we think this paper is great for you:
This research investigates the specific failures of LLMs within agentic contexts, providing a valuable understanding of the challenges inherent in building effective agentic AI. Understanding these failures is crucial for developing robust and reliable agentic systems.

Architectures for Building Agentic AI

Halmstad University

Rate paper: 👍 👎 ♥ Save

AI Summary

Multi-agent systems exchange a single 'do-everything' agent for a team of specialised agents that co-operate (or compete) under explicit protocols. [3]
Planning- and self-improvement agents: A class of AI systems that use search and optimization techniques to solve complex problems. [3]
Embodied and web agents: AI systems that act in the world, either physically (embodied) or through interactions with untrusted websites and enterprise systems (web). [3]
Planning- and self-improvement agents can be prone to state explosion, speculative arithmetic errors, and over-confident selection. [3]
Planning- and self-improvement agents deliver substantial reliability dividends when their power is channelled through explicit controllers, trustworthy verifiers, and disciplined governance of cost and side-effects. [2]

Abstract
This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.

Why we think this paper is great for you:
The paper’s emphasis on architectural properties of agentic systems aligns directly with the user’s interest in building reliable AI agents. It highlights the importance of a structured approach to agent design, a foundational concept for effective agentic AI.

The Adoption and Usage of AI Agents: Early Evidence from Perplexity

Perplexity

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

The agent is used primarily for productivity-related tasks (36% of all queries), followed by learning, media, and shopping. [3]
Research, document editing, and shopping-related tasks appear consistently across occupation clusters. [3]
Knowledge-intensive sectors like digital technology, entrepreneurship, finance, and academia tend to use the agent for research and learning-related tasks. [3]
Productivity and learning topics are the most sticky, while travel is the least sticky. [2]
Users' first queries often fall into productivity, learning, or media topics, but over time, there's a shift towards more cognitively oriented use cases. [1]

Abstract
This paper presents the first large-scale field study of the adoption, usage intensity, and use cases of general-purpose AI agents operating in open-world web environments. Our analysis centers on Comet, an AI-powered browser developed by Perplexity, and its integrated agent, Comet Assistant. Drawing on hundreds of millions of anonymized user interactions, we address three fundamental questions: Who is using AI agents? How intensively are they using them? And what are they using them for? Our findings reveal substantial heterogeneity in adoption and usage across user segments. Earlier adopters, users in countries with higher GDP per capita and educational attainment, and individuals working in digital or knowledge-intensive sectors -- such as digital technology, academia, finance, marketing, and entrepreneurship -- are more likely to adopt or actively use the agent. To systematically characterize the substance of agent usage, we introduce a hierarchical agentic taxonomy that organizes use cases across three levels: topic, subtopic, and task. The two largest topics, Productivity & Workflow and Learning & Research, account for 57% of all agentic queries, while the two largest subtopics, Courses and Shopping for Goods, make up 22%. The top 10 out of 90 tasks represent 55% of queries. Personal use constitutes 55% of queries, while professional and educational contexts comprise 30% and 16%, respectively. In the short term, use cases exhibit strong stickiness, but over time users tend to shift toward more cognitively oriented topics. The diffusion of increasingly capable AI agents carries important implications for researchers, businesses, policymakers, and educators, inviting new lines of inquiry into this rapidly emerging class of AI capabilities.

Why we think this paper is great for you:
This study examines the real-world adoption and usage of AI agents, offering valuable insights into how these systems are being deployed and utilized. Understanding current usage patterns is essential for informed development and future research.

Deconstructing the Dual Black Box:A Plug-and-Play Cognitive Framework for Human-AI Collaborative Enhancement and Its Implications for AI Governance

Northeastern University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

RAMTN系统是一种基于元交互的人机协作认知增强范式，旨在通过提取专家决策框架来实现智能辅助和知识共享。该系统的核心思想是将人类专家的认知过程与计算机系统的信息处理能力结合起来，从而实现高效的决策支持和知识推理。 RAMTN系统的应用领域包括投资、医疗和教育等多个领域，旨在通过提取专家决策框架来提高决策准确性和效率。元交互（Meta-Interaction）：一种将人类认知过程与计算机系统信息处理能力结合起来的技术，旨在实现高效的决策支持和知识推理。人机协作认知增强范式（Human-Machine Collaborative Cognition Enhancement Paradigm）：一种基于元交互的框架，旨在通过提取专家决策框架来实现智能辅助和知识共享。 RAMTN系统是一种创新性的解决方案，旨在通过提取专家决策框架来提高决策准确性和效率。该系统的应用领域包括投资、医疗和教育等多个领域，具有广泛的潜力和前景。该系统的开发和应用依赖于大量的数据和信息资源，可能存在数据质量和可靠性的问题。该系统的安全性和隐私保护需要进一步研究和解决。元交互技术在决策支持和知识推理领域有广泛的应用和研究。 [3]

Abstract
Currently, there exists a fundamental divide between the "cognitive black box" (implicit intuition) of human experts and the "computational black box" (untrustworthy decision-making) of artificial intelligence (AI). This paper proposes a new paradigm of "human-AI collaborative cognitive enhancement," aiming to transform the dual black boxes into a composable, auditable, and extensible "functional white-box" system through structured "meta-interaction." The core breakthrough lies in the "plug-and-play cognitive framework"--a computable knowledge package that can be extracted from expert dialogues and loaded into the Recursive Adversarial Meta-Thinking Network (RAMTN). This enables expert thinking, such as medical diagnostic logic and teaching intuition, to be converted into reusable and scalable public assets, realizing a paradigm shift from "AI as a tool" to "AI as a thinking partner." This work not only provides the first engineering proof for "cognitive equity" but also opens up a new path for AI governance: constructing a verifiable and intervenable governance paradigm through "transparency of interaction protocols" rather than prying into the internal mechanisms of models. The framework is open-sourced to promote technology for good and cognitive inclusion. This paper is an independent exploratory research conducted by the author. All content presented, including the theoretical framework (RAMTN), methodology (meta-interaction), system implementation, and case validation, constitutes the author's individual research achievements.

Why we think this paper is great for you:
The paper’s focus on bridging the gap between human and AI cognitive processes – a ‘dual black box’ – is highly relevant to building trustworthy and collaborative AI agents. This approach addresses a critical challenge in the field.

Advancing Mathematical Research via Human-AI Interactive Theorem Proving

Peking University

Rate paper: 👍 👎 ♥ Save

AI Summary

Previous research has shown that human-AI collaboration can improve performance in various tasks, including theorem discovery and proof verification. [3]
The collaboration between human experts and an LLM is organized into three stages, starting from an informal conjecture and ending with a precise theorem and proof. [2]
Human-AI collaboration can significantly improve mathematical proof and theorem discovery. [1]

Abstract
We investigate how large language models can be used as research tools in scientific computing while preserving mathematical rigor. We propose a human-in-the-loop workflow for interactive theorem proving and discovery with LLMs. Human experts retain control over problem formulation and admissible assumptions, while the model searches for proofs or contradictions, proposes candidate properties and theorems, and helps construct structures and parameters that satisfy explicit constraints, supported by numerical experiments and simple verification checks. Experts treat these outputs as raw material, further refine them, and organize the results into precise statements and rigorous proofs. We instantiate this workflow in a case study on the connection between manifold optimization and Grover's quantum search algorithm, where the pipeline helps identify invariant subspaces, explore Grover-compatible retractions, and obtain convergence guarantees for the retraction-based gradient method. The framework provides a practical template for integrating large language models into frontier mathematical research, enabling faster exploration of proof space and algorithm design while maintaining transparent reasoning responsibilities. Although illustrated on manifold optimization problems in quantum computing, the principles extend to other core areas of scientific computing.

Why we think this paper is great for you:
This research explores the use of LLMs as tools in scientific computing, particularly in interactive theorem proving – a sophisticated technique. It’s a promising area for leveraging LLMs to enhance mathematical research.

Towards Foundation Models with Native Multi-Agent Intelligence

Rate paper: 👍 👎 ♥ Save

Abstract
Foundation models (FMs) are increasingly assuming the role of the "brain" of AI agents. While recent efforts have begun to equip FMs with native single-agent abilities -- such as GUI interaction or integrated tool use -- we argue that the next frontier is endowing FMs with native multi-agent intelligence. We identify four core capabilities of FMs in multi-agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence across 41 large language models showing that strong single-agent performance alone does not automatically yield robust multi-agent intelligence. To address this gap, we outline key research directions -- spanning dataset construction, evaluation, training paradigms, and safety considerations -- for building FMs with native multi-agent intelligence.

Why we think this paper is great for you:
The paper’s argument for equipping foundation models with native multi-agent intelligence represents a significant step forward in building more capable and adaptable AI systems. This aligns directly with the user's interest in advanced agentic AI.

Research Automation with AI

Kaapana: A Comprehensive Open-Source Platform for Integrating AI in Medical Imaging Research Environments

German Cancer Research

Rate paper: 👍 👎 ♥ Save

Abstract
Developing generalizable AI for medical imaging requires both access to large, multi-center datasets and standardized, reproducible tooling within research environments. However, leveraging real-world imaging data in clinical research environments is still hampered by strict regulatory constraints, fragmented software infrastructure, and the challenges inherent in conducting large-cohort multicentre studies. This leads to projects that rely on ad-hoc toolchains that are hard to reproduce, difficult to scale beyond single institutions and poorly suited for collaboration between clinicians and data scientists. We present Kaapana, a comprehensive open-source platform for medical imaging research that is designed to bridge this gap. Rather than building single-use, site-specific tooling, Kaapana provides a modular, extensible framework that unifies data ingestion, cohort curation, processing workflows and result inspection under a common user interface. By bringing the algorithm to the data, it enables institutions to keep control over their sensitive data while still participating in distributed experimentation and model development. By integrating flexible workflow orchestration with user-facing applications for researchers, Kaapana reduces technical overhead, improves reproducibility and enables conducting large-scale, collaborative, multi-centre imaging studies. We describe the core concepts of the platform and illustrate how they can support diverse use cases, from local prototyping to nation-wide research networks. The open-source codebase is available at https://github.com/kaapana/kaapana

AGI: Artificial General Intelligence

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Meta

Rate paper: 👍 👎 ♥ Save

Abstract
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

AI Summary

SWE-Bench: a comprehensive benchmark to evaluate autonomous code-writing and code-fixing agents on realistic tasks. [3]
The combination of monorepo development and LLM-based tools like ECO underscores a trend toward holistic scale: treating an entire organization’s code as a single evolvable system, with AI agents providing the intelligence to manage global changes, dependency analysis, and performance tuning in ways humans alone could not easily scale. [2]
Large-scale software engineering has driven interest in AI assistance for code discovery, understanding, and consistent changes at scale. [1]

Deep Learning

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

Universidad de Guanajuato

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala.

AI Summary

In the big data era, data completeness can be as important as algorithm sophistication. [3]
Big Data Analytics Distributed Computing Scalability Algorithm Sophistication Data Completeness The chronological progression demonstrates that mastering big data requires a systematic approach. [3]
The choice between local and distributed architectures is not merely about computational resources, but about the quality and completeness of the data available to the model. [2]

Beyond the Hype: Comparing Lightweight and Deep Learning Models for Air Quality Forecasting

National University of

Rate paper: 👍 👎 ♥ Save

Abstract
Accurate forecasting of urban air pollution is essential for protecting public health and guiding mitigation policies. While Deep Learning (DL) and hybrid pipelines dominate recent research, their complexity and limited interpretability hinder operational use. This study investigates whether lightweight additive models -- Facebook Prophet (FBP) and NeuralProphet (NP) -- can deliver competitive forecasts for particulate matter (PM$_{2.5}$, PM$_{10}$) in Beijing, China. Using multi-year pollutant and meteorological data, we applied systematic feature selection (correlation, mutual information, mRMR), leakage-safe scaling, and chronological data splits. Both models were trained with pollutant and precursor regressors, with NP additionally leveraging lagged dependencies. For context, two machine learning baselines (LSTM, LightGBM) and one traditional statistical model (SARIMAX) were also implemented. Performance was evaluated on a 7-day holdout using MAE, RMSE, and $R^2$. Results show that FBP consistently outperformed NP, SARIMAX, and the learning-based baselines, achieving test $R^2$ above 0.94 for both pollutants. These findings demonstrate that interpretable additive models remain competitive with both traditional and complex approaches, offering a practical balance of accuracy, transparency, and ease of deployment.

AI Summary

The study also explores the impact of different input features on the performance of the models and finds that using both air quality index and weather data improves the predictive power of the models. [3]
AQI: Air Quality Index MAE: Mean Absolute Error The study demonstrates the effectiveness of machine learning models in predicting AQIs and highlights the importance of using both air quality index and weather data for improved predictive power. [3]
The results of this study can be used to inform policy decisions related to air pollution control and mitigation strategies. [3]
The study only evaluates the performance of different models on a single dataset and does not explore the generalizability of the results to other locations or datasets. [3]
The authors do not provide any discussion on the limitations of the study, such as the potential impact of data quality issues or the lack of consideration for non-linear relationships between input features. [3]
The paper presents a comparative study of various machine learning models for predicting air quality indices (AQIs) in Beijing, China. [2]
The results show that the Prophet model outperforms other models in terms of accuracy, with a mean absolute error (MAE) of 4.35 μg/m³. [1]

📝 Consider adding more interests!
You currently have 2 interests registered. Adding more interests will help us provide better and more diverse paper recommendations.

Add More Interests

We did not find tons of content matching your interests we've included some additional topics that are popular. Also be aware that if the topics is not present in arxiv we wont be able to recommend it.

AI and Society

Deconstructing the Dual Black Box:A Plug-and-Play Cognitive Framework for Human-AI Collaborative Enhancement and Its Implications for AI Governance

Northeastern University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

RAMTN系统是一种基于元交互的人机协作认知增强范式，旨在通过提取专家决策框架来实现智能辅助和知识共享。该系统的核心思想是将人类专家的认知过程与计算机系统的信息处理能力结合起来，从而实现高效的决策支持和知识推理。 RAMTN系统的应用领域包括投资、医疗和教育等多个领域，旨在通过提取专家决策框架来提高决策准确性和效率。元交互（Meta-Interaction）：一种将人类认知过程与计算机系统信息处理能力结合起来的技术，旨在实现高效的决策支持和知识推理。人机协作认知增强范式（Human-Machine Collaborative Cognition Enhancement Paradigm）：一种基于元交互的框架，旨在通过提取专家决策框架来实现智能辅助和知识共享。 RAMTN系统是一种创新性的解决方案，旨在通过提取专家决策框架来提高决策准确性和效率。该系统的应用领域包括投资、医疗和教育等多个领域，具有广泛的潜力和前景。该系统的开发和应用依赖于大量的数据和信息资源，可能存在数据质量和可靠性的问题。该系统的安全性和隐私保护需要进一步研究和解决。元交互技术在决策支持和知识推理领域有广泛的应用和研究。 [3]

Research Automation with AI

Advancing Mathematical Research via Human-AI Interactive Theorem Proving

Peking University

Rate paper: 👍 👎 ♥ Save

AI Summary

Previous research has shown that human-AI collaboration can improve performance in various tasks, including theorem discovery and proof verification. [3]
The collaboration between human experts and an LLM is organized into three stages, starting from an informal conjecture and ending with a precise theorem and proof. [2]
Human-AI collaboration can significantly improve mathematical proof and theorem discovery. [1]

Kaapana: A Comprehensive Open-Source Platform for Integrating AI in Medical Imaging Research Environments

German Cancer Research

Rate paper: 👍 👎 ♥ Save

AGI: Artificial General Intelligence

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback