Hi!

Your personalized paper recommendations for 19 to 23 January, 2026.

How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Knowledge? A Software Engineering Framework

Siemens AG

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The framework enables non-experts using simple prompts to generate visualizations that correctly apply nuanced expert rules, bridging the expertise gap and mitigating the expert bottleneck. (ML: 0.98)👍👎
The solution represents a significant step towards democratizing access to specialized expertise through an agent, enabling more efficient and effective data analysis across industries. (ML: 0.95)👍👎
Evaluator assessments validate the practical impact of the framework, with baseline outputs deemed unreadable and proposed system outputs receiving praise for showing the optimization process clearly. (ML: 0.94)👍👎
The paper proposes a framework for capturing expert domain knowledge and leveraging it to construct LLM-based AI agents capable of autonomous expert-level performance. (ML: 0.90)👍👎
The framework integrates a Retrieval-Augmented Generation (RAG) system, codified expert rules, and visualization design principles directly into the Agent. (ML: 0.89)👍👎
LLM: Large Language Model RAG: Retrieval-Augmented Generation system Physics-Agnostic design pattern: a design approach that decouples visualization rules from specific physical phenomena The research contributes a robust AI agent for visualization generation and a systematic, validated framework for engineering AI agents with human expert domain knowledge. (ML: 0.85)👍👎
Technical validation demonstrates the framework's effectiveness, achieving 206% improvement in output quality across five scenarios spanning three simulation domains. (ML: 0.84)👍👎

Abstract
Critical domain knowledge typically resides with few experts, creating organizational bottlenecks in scalability and decision-making. Non-experts struggle to create effective visualizations, leading to suboptimal insights and diverting expert time. This paper investigates how to capture and embed human domain knowledge into AI agent systems through an industrial case study. We propose a software engineering framework to capture human domain knowledge for engineering AI agents in simulation data visualization by augmenting a Large Language Model (LLM) with a request classifier, Retrieval-Augmented Generation (RAG) system for code generation, codified expert rules, and visualization design principles unified in an agent demonstrating autonomous, reactive, proactive, and social behavior. Evaluation across five scenarios spanning multiple engineering domains with 12 evaluators demonstrates 206% improvement in output quality, with our agent achieving expert-level ratings in all cases versus baseline's poor performance, while maintaining superior code quality with lower variance. Our contributions are: an automated agent-based system for visualization generation and a validated framework for systematically capturing human domain knowledge and codifying tacit expert knowledge into AI agents, demonstrating that non-experts can achieve expert-level outcomes in specialized domains.

Why we are recommending this paper?
Due to your Interest in AI for Data Science Engineering

This paper explores a critical approach to building AI agents, focusing on integrating expert knowledge – directly relevant to managing teams of data scientists. It addresses the challenge of scaling AI solutions within organizations, aligning with your interest in team management and AI for data science.

Understanding Human-Multi-Agent Team Formation for Creative Work

KAIST

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

AI agents initiated more feedback sessions than participants in Cycle 1, but participants became more active in giving feedback across cycles. (ML: 0.99)👍👎
Participants sought to enable high-quality ideation while minimizing their own workload, expecting agents to autonomously generate and elaborate ideas with minimal human intervention. (ML: 0.97)👍👎
Feedback: The process of providing input or suggestions on how to improve an idea or solution. (ML: 0.97)👍👎
Idea Evaluation: The process of evaluating the quality or feasibility of generated ideas. (ML: 0.96)👍👎
Users rarely participated in Idea Generation but frequently took Idea Evaluation roles. (ML: 0.96)👍👎
Idea Generation: The process of generating new ideas or solutions. (ML: 0.95)👍👎
Participants formed 36 teams, exploring diverse formations in team structure and role allocation. (ML: 0.93)👍👎
Requests: The process of asking for specific information or clarification on an idea or solution. (ML: 0.92)👍👎
Human-Multi-Agent Teams (HMATs): A team consisting of humans and artificial intelligence agents working together to achieve a common goal. (ML: 0.87)👍👎
Single-tier Hierarchy teams were most common, with users typically serving as the sole leader managing all agents directly. (ML: 0.86)👍👎

Abstract
Team-based collaboration is a cornerstone of modern creative work. Recent advances in generative AI open possibilities for humans to collaborate with multiple AI agents in distinct roles to address complex creative workflows. Yet, how to form Human-Multi-Agent Teams (HMATs) is underexplored, especially given that inter-agent interactions increase complexity and the risk of unexpected behaviors. In this exploratory study, we aim to understand how to form HMATs for creative work using CrafTeam, a technology probe that allows users to form and collaborate with their teams. We conducted a study with 12 design practitioners, in which participants iterated through a three-step cycle: forming HMATs, ideating with their teams, and reflecting on their teams' ideation. Our findings reveal that while participants initially attempted autonomous team operations, they ultimately adopted team formations in which they directly orchestrated agents. We discuss design considerations for HMAT formation that humans can effectively orchestrate multiple agents.

Why we are recommending this paper?
Due to your Interest in Managing tech teams

This research investigates the formation of teams involving humans and AI agents, a key area for managing complex tech teams. The focus on creative workflows resonates with your interest in AI for data science engineering and the strategic use of AI in teams.

On Autopilot? An Empirical Study of Human-AI Teaming and Review Practices in Open Source

The University of Melbourne

Rate paper: 👍 👎 ♥ Save

AI Insights

The majority of developers (75.6%) reported using AI models for coding tasks, while 44.1% used them for reviewing code. (ML: 0.98)👍👎
The study highlights the growing importance of AI models in software development, particularly for coding tasks and code review. (ML: 0.97)👍👎
The study found that the use of AI models increased productivity by an average of 23.4%, but also introduced new challenges such as understanding model outputs and debugging issues. (ML: 0.96)👍👎
Pre-trained language models: AI models trained on large datasets to generate human-like text or perform specific tasks. (ML: 0.95)👍👎
The researchers identified three types of AI models used in software development: pre-trained language models, code generation tools, and chatbots. (ML: 0.95)👍👎
The study examines the use of AI models in software development, specifically focusing on GitHub pull requests. (ML: 0.95)👍👎
Developers are increasingly relying on AI models to improve productivity, but also face challenges in understanding model outputs and debugging issues. (ML: 0.94)👍👎
Chatbots: computer programs that simulate conversation with humans using natural language processing (NLP) and machine learning algorithms. (ML: 0.93)👍👎
Code generation tools: software tools that use AI to generate code based on user input or specifications. (ML: 0.91)👍👎
GitHub pull requests: a feature in GitHub that allows developers to propose changes to a project's codebase. (ML: 0.83)👍👎

Abstract
Large Language Models (LLMs) increasingly automate software engineering tasks. While recent studies highlight the accelerated adoption of ``AI as a teammate'' in Open Source Software (OSS), developer interaction patterns remain under-explored. In this work, we investigated project-level guidelines and developers' interactions with AI-assisted pull requests (PRs) by expanding the AIDev dataset to include finer-grained contributor code ownership and a comparative baseline of human-created PRs. We found that over 67.5\% of AI-co-authored PRs originate from contributors without prior code ownership. Despite this, the majority of repositories lack guidelines for AI-coding agent usage. Notably, we observed a distinct interaction pattern: AI-co-authored PRs are merged significantly faster with minimal feedback. In contrast to human-created PRs where non-owner developers receive the most feedback, AI-co-authored PRs from non-owners receive the least, with approximately 80\% merged without any explicit review. Finally, we discuss implications for developers and researchers.

Why we are recommending this paper?
Due to your Interest in Managing teams of data scientists

This study examines the interaction patterns between developers and AI teammates in open-source projects, directly addressing the management of tech teams. Understanding how humans and AI collaborate is crucial for effective team leadership.

A Distributed Spatial Data Warehouse for AIS Data (DIPAAL)

Aalborg University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

They also discuss the challenges associated with processing and analyzing such massive amounts of data in real-time. (ML: 0.95)👍👎
However, there are still challenges associated with processing and analyzing such massive amounts of data in real-time. (ML: 0.95)👍👎
There is limited discussion on the scalability and performance of the system. (ML: 0.93)👍👎
The paper does not provide a detailed evaluation of the proposed approach. (ML: 0.91)👍👎
The authors discuss various existing approaches for managing trajectory data, including Hadoop GIS, TrajMesa, and MobilityDB. (ML: 0.85)👍👎
The authors propose a novel approach for managing large-scale AIS (Automatic Identification System) datasets using a distributed database system. (ML: 0.85)👍👎
The paper discusses the design and implementation of a maritime data warehouse, called MobiSpaces, which is part of the European Union's funded Project under grant agreement no 101070279. (ML: 0.84)👍👎
AIS: Automatic Identification System MobiSpaces: A maritime data warehouse project funded by the European Union The proposed approach for managing AIS datasets using a distributed database system is efficient and scalable. (ML: 0.81)👍👎

Abstract
AIS data from ships is excellent for analyzing single-ship movements and monitoring all ships within a specific area. However, the AIS data needs to be cleaned, processed, and stored before being usable. This paper presents a system consisting of an efficient and modular ETL process for loading AIS data, as well as a distributed spatial data warehouse storing the trajectories of ships. To efficiently analyze a large set of ships, a raster approach to querying the AIS data is proposed. A spatially partitioned data warehouse with a granularized cell representation and heatmap presentation is designed, developed, and evaluated. Currently the data warehouse stores ~312 million kilometers of ship trajectories and more than +8 billion rows in the largest table. It is found that searching the cell representation is faster than searching the trajectory representation. Further, we show that the spatially divided shards enable a consistently good scale-up for both cell and heatmap analytics in large areas, ranging between 354% to 1164% with a 5x increase in workers

Why we are recommending this paper?
Due to your Interest in AI for Data Science Management

This paper focuses on data warehousing and processing, which is a foundational element of data science engineering. It provides insights into data management strategies, a key skill for managing data science teams.

Reclaiming Software Engineering as the Enabling Technology for the Digital Age

Informatics EuropeERCIM WG on Software Research

Rate paper: 👍 👎 ♥ Save

AI Insights

Policy blindness: the tendency for funding programs to treat software as an auxiliary technology rather than a key enabler. (ML: 0.98)👍👎
Software engineering is treated as a supporting activity rather than a scientific discipline in many policy frameworks and funding programs. (ML: 0.97)👍👎
Fragmented incentives: internal metrics, publish-or-perish culture, quantification of research outputs, and stringent validation demands that encourage incremental innovation. (ML: 0.96)👍👎
The erosion of software engineering's identity is not only a visibility problem but also a community problem that affects its ability to define shared challenges and attract talent and funding. (ML: 0.95)👍👎
Software engineering research and innovation have built the digital world but now face an existential crisis. (ML: 0.94)👍👎
To remain relevant, we must reclaim our identity as a scientific discipline that enables the digital age worldwide. (ML: 0.94)👍👎
The software engineering research community has never been larger or more productive, but its identity is dissolving into the technologies it enables. (ML: 0.94)👍👎
Industrial disconnect: the lack of shared tools, processes, and architectural conventions across sectors, leading to repeated solutions of the same problems in isolation. (ML: 0.93)👍👎
A joint working group between ERCIM and Informatics Europe has been established to propose a community-driven realignment of research, education, and policy to elevate software research as a strategic priority in Europe. (ML: 0.93)👍👎
Frontier technologies such as AI, quantum computing, photonics, and cybersecurity rely on software engineering to achieve scalability, reliability, and safety. (ML: 0.79)👍👎

Abstract
Software engineering is the invisible infrastructure of the digital age. Every breakthrough in artificial intelligence, quantum computing, photonics, and cybersecurity relies on advances in software engineering, yet the field is too often treated as a supportive digital component rather than as a strategic, enabling discipline. In policy frameworks, including major European programmes, software appears primarily as a building block within other technologies, while the scientific discipline of software engineering remains largely absent. This position paper argues that the long-term sustainability, dependability, and sovereignty of digital technologies depend on investment in software engineering research. It is a call to reclaim the identity of software engineering.

Why we are recommending this paper?
Due to your Interest in Data Science Engineering Management

This paper highlights the critical role of software engineering in enabling advancements across various fields, including AI. It’s a valuable perspective on the foundational technologies that support data science initiatives.

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Hong Kong Polytechnic University

Rate paper: 👍 👎 ♥ Save

AI Insights

The task granularity is flexible, and every reasoning chain must start from the raw data or a logically prior step. (ML: 0.97)👍👎
The instructions may be too complex or detailed for some users, potentially leading to confusion. (ML: 0.95)👍👎
The provided Jupyter Notebook content is a template for generating data science questions based on an answered notebook. (ML: 0.95)👍👎
QRA: Question-Reasoning-Answer triplet JSON: JavaScript Object Notation Generating high-quality data science questions based on an answered notebook requires careful analysis and adherence to specific guidelines. (ML: 0.94)👍👎
The output format requires a valid JSON object with specific keys such as 'data_type', 'domain', 'task_type', 'language', 'question', 'reasoning', 'answer', 'best_score (Optional)', and 'confidence'. (ML: 0.89)👍👎
The final output must be a valid JSON object with the specified structure. (ML: 0.82)👍👎
The instructions provide detailed guidelines for generating QRA triplets, including the importance of not mentioning the notebook and ensuring diversity across task types. (ML: 0.79)👍👎
The output format must conform to a valid JSON object with specified keys, ensuring that the generated QRA triplets are accurate and comprehensive. (ML: 0.77)👍👎

Abstract
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.

Why we are recommending this paper?
Due to your Interest in Managing teams of data scientists

Business Logic-Driven Text-to-SQL Data Synthesis for Business Intelligence

Columbia University

Rate paper: 👍 👎 ♥ Save

AI Insights

They also note that the results may not generalize to other domains or tasks beyond text-to-SQL. (ML: 0.99)👍👎
The authors acknowledge the limitations of their study, including the small size of the dataset and the potential bias in the persona modeling prompt. (ML: 0.98)👍👎
The authors propose a framework for generating high-quality datasets that align with business intelligence (BI) settings and evaluate the performance of various LLMs using this framework. (ML: 0.97)👍👎
Previous studies have shown that LLMs can be effective on certain NLP tasks, but struggle with others, highlighting the need for more research in this area. (ML: 0.96)👍👎
The proposed framework generates high-quality datasets that align with BI settings and evaluates the performance of various LLMs using this framework. (ML: 0.96)👍👎
The text is a research paper on developing a benchmark for evaluating large language models (LLMs) on real-world text-to-SQL tasks. (ML: 0.96)👍👎
The results show that some LLMs perform well on certain aspects of text-to-SQL, but struggle with others, highlighting the need for more research in this area. (ML: 0.95)👍👎
Text-to-SQL: A task where a natural language question is converted into an SQL query to retrieve relevant data from a database. (ML: 0.94)👍👎
LLMs: Large Language Models, which are artificial intelligence models that can process and generate human-like text. (ML: 0.94)👍👎
ReAct paradigm: A prompt-based agentic framework for evaluating LLMs on complex tasks such as text-to-SQL. (ML: 0.93)👍👎

Abstract
Evaluating Text-to-SQL agents in private business intelligence (BI) settings is challenging due to the scarcity of realistic, domain-specific data. While synthetic evaluation data offers a scalable solution, existing generation methods fail to capture business realism--whether questions reflect realistic business logic and workflows. We propose a Business Logic-Driven Data Synthesis framework that generates data grounded in business personas, work scenarios, and workflows. In addition, we improve the data quality by imposing a business reasoning complexity control strategy that diversifies the analytical reasoning steps required to answer the questions. Experiments on a production-scale Salesforce database show that our synthesized data achieves high business realism (98.44%), substantially outperforming OmniSQL (+19.5%) and SQL-Factory (+54.7%), while maintaining strong question-SQL alignment (98.59%). Our synthetic data also reveals that state-of-the-art Text-to-SQL models still have significant performance gaps, achieving only 42.86% execution accuracy on the most complex business queries.

Why we are recommending this paper?
Due to your Interest in Data Science Engineering

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

Data Science Management
Engineering Management

You can edit or add more interests any time.

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback