DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Peking University

Rate paper: 👍 👎 ♥ Save

AI Insights

It organizes operators along multiple orthogonal categorization dimensions, including modality, core vs. [3]
PyTorch: An open-source machine learning library for Python. [3]
domain-specific, and functional categories. [2]
DataFlow is a unified data preparation framework that supports end-to-end LLM data preparation workflows. [1]

Abstract
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

Why we are recommending this paper?
Due to your Interest in: Data Science Engineering

This paper directly addresses the need for automated data pipelines, aligning with your interest in data science engineering and AI for data science. The framework's focus on LLMs and workflow automation is highly relevant to managing and optimizing data science teams.

Beyond Text-to-SQL: Autonomous Research-Driven Database Exploration with DAR

Mantis Analytics

Rate paper: 👍 👎 ♥ Save

Abstract
Large language models can already query databases, yet most existing systems remain reactive: they rely on explicit user prompts and do not actively explore data. We introduce DAR (Data Agnostic Researcher), a multi-agent system that performs end-to-end database research without human-initiated queries. DAR orchestrates specialized AI agents across three layers: initialization (intent inference and metadata extraction), execution (SQL and AI-based query synthesis with iterative validation), and synthesis (report generation with built-in quality control). All reasoning is executed directly inside BigQuery using native generative AI functions, eliminating data movement and preserving data governance. On a realistic asset-incident dataset, DAR completes the full analytical task in 16 minutes, compared to 8.5 hours for a professional analyst (approximately 32x times faster), while producing useful pattern-based insights and evidence-grounded recommendations. Although human experts continue to offer deeper contextual interpretation, DAR excels at rapid exploratory analysis. Overall, this work shifts database interaction from query-driven assistance toward autonomous, research-driven exploration within cloud data warehouses.

Why we are recommending this paper?
Due to your Interest in: Data Science Engineering

Given your interest in AI for data science management, this paper’s exploration of autonomous database research using LLMs is a strong fit. The system’s ability to proactively explore data aligns with strategies for efficient data access and utilization within a team.

Towards AI-Supported Research: a Vision of the TIB AIssistant

TIB Leibniz Information

Rate paper: 👍 👎 ♥ Save

AI Insights

ORKG (Open Research Knowledge Graph): A large-scale knowledge graph that integrates various sources of research information. [3]
The paper discusses the development of an AI-supported research platform called Tib Aissistant, which aims to facilitate research across various life cycles. [2]
Tib Aissistant's architecture is based on a modular design, with components for prompt engineering, tool integration, and knowledge graph-based search. [1]

Abstract
The rapid advancements in Generative AI and Large Language Models promise to transform the way research is conducted, potentially offering unprecedented opportunities to augment scholarly workflows. However, effectively integrating AI into research remains a challenge due to varying domain requirements, limited AI literacy, the complexity of coordinating tools and agents, and the unclear accuracy of Generative AI in research. We present the vision of the TIB AIssistant, a domain-agnostic human-machine collaborative platform designed to support researchers across disciplines in scientific discovery, with AI assistants supporting tasks across the research life cycle. The platform offers modular components - including prompt and tool libraries, a shared data store, and a flexible orchestration framework - that collectively facilitate ideation, literature analysis, methodology development, data analysis, and scholarly writing. We describe the conceptual framework, system architecture, and implementation of an early prototype that demonstrates the feasibility and potential impact of our approach.

Why we are recommending this paper?
Due to your Interest in: AI for Data Science Engineering

Coming from TIB Leibniz Information, this paper explores the broader integration of AI into research workflows, directly addressing your interest in augmenting scholarly workflows with AI. It offers insights into how AI can support research teams and processes.

A Data Annotation Requirements Representation and Specification (DARS)

University of Gothenburg

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
With the rise of AI-enabled cyber-physical systems, data annotation has become a critical yet often overlooked process in the development of these intelligent information systems. Existing work in requirements engineering (RE) has explored how requirements for AI systems and their data can be represented. However, related interviews with industry professionals show that data annotations and their related requirements introduce distinct challenges, indicating a need for annotation-specific requirement representations. We propose the Data Annotation Requirements Representation and Specification (DARS), including an Annotation Negotiation Card to align stakeholders on objectives and constraints, and a Scenario-Based Annotation Specification to express atomic and verifiable data annotation requirements. We evaluate DARS with an automotive perception case related to an ongoing project, and a mapping against 18 real-world data annotation error types. The results suggest that DARS mitigates root causes of completeness, accuracy, and consistency annotation errors. By integrating DARS into RE, this work improves the reliability of safety-critical systems using data annotations and demonstrates how engineering frameworks must evolve for data-dependent components of today's intelligent information systems.

Why we are recommending this paper?
Due to your Interest in: Data Science Engineering Management

This paper’s focus on data annotation requirements, particularly in the context of AI-enabled systems, is highly relevant to your interest in data science engineering management. Understanding data requirements is crucial for effective team management and project success.

Implementing a Scalable, Redeployable and Multitiered Repository for FAIR and Secure Scientific Data Sharing: The BIG-MAP Archive

EPFL cole Polytechnique

Rate paper: 👍 👎 ♥ Save

AI Insights

The archive has been integrated with other services such as Materials Cloud Archive and Keycloak for authentication and authorization. [3]
It is also connected to the DECODE project, which aims to develop a European strategy for artificial intelligence in science. [3]
The archive has been used by researchers from various institutions, including the Technical University of Denmark (DTU) and the Swiss Supercomputer Center (CSCS). [3]
It has also been cited in several publications, demonstrating its utility as a research tool. [3]
The BIG-MAP Archive is a digital repository for storing and sharing research data in the field of materials science. [2]
Keycloak: An open-source authentication and authorization server. [1]

Abstract
Data sharing in large consortia, such as research collaborations or industry partnerships, requires addressing both organizational and technical challenges. A common platform is essential to promote collaboration, facilitate exchange of findings, and ensure secure access to sensitive data. Key technical challenges include creating a scalable architecture, a user-friendly interface, and robust security and access control. The BIG-MAP Archive is a cloud-based, disciplinary, private repository designed to address these challenges. Built on InvenioRDM, it leverages platform functionalities to meet consortium-specific needs, providing a tailored solution compared to general repositories. Access can be restricted to members of specific communities or open to the entire consortium, such as the BATTERY 2030+, a consortium accelerating advanced battery technologies. Uploaded data and metadata are controlled via fine grained permissions, allowing access to individual project members or the full initiative. The formalized upload process ensures data are formatted and ready for publication in open repositories when needed. This paper reviews the repository's key features, showing how the BIG-MAP Archive enables secure, controlled data sharing within large consortia. It ensures data confidentiality while supporting flexible, permissions-based access and can be easily redeployed for other consortia, including MaterialsCommons4.eu and RAISE (Resource for AI Science in Europe).

Why we are recommending this paper?
Due to your Interest in: Managing teams of data scientists

Given your interest in managing teams of data scientists and the need for secure data sharing, this paper’s exploration of a scalable data repository aligns with your focus on efficient data management and collaboration within scientific consortia.

Interests not found

Help us improve your experience!