Hi!

Your personalized paper recommendations for 01 to 05 December, 2025.
🎯 Top Personalized Recommendations
UnB
Rate paper: 👍 👎 ♥ Save
AI Summary
  • Data Science: The study of extracting insights from large datasets using various techniques, including machine learning, statistics, and visualization. [3]
  • A systematic literature review was conducted to analyze the application of risk management frameworks in data science projects. [2]
Abstract
Data science initiatives frequently exhibit high failure rates, driven by technical constraints, organizational limitations and insufficient risk management practices. Challenges such as low data maturity, lack of governance, misalignment between technical and business teams, and the absence of structured mechanisms to address ethical and sociotechnical risks have been widely identified in the literature. In this context, the purpose of this study is to conduct a comparative analysis of the main risk management methodologies applied to data science projects, aiming to identify, classify, and synthesize their similarities, differences and existing gaps. An integrative literature review was performed using indexed databases and a structured protocol for selection and content analysis. The study examines widely adopted risk management standards ISO 31000, PMBOK Risk Management and NIST RMF, as well as frameworks specific to data science workflows, such as CRISP DM and the recently proposed DS EthiCo RMF, which incorporates ethical and sociotechnical dimensions into the project life cycle. The findings reveal that traditional approaches provide limited coverage of emerging risks, whereas contemporary models propose multidimensional structures capable of integrating ethical oversight, governance and continuous monitoring. As a contribution, this work offers theoretical support for the development of hybrid frameworks that balance technical efficiency, organizational alignment and responsible data practices, while highlighting research gaps that can guide future investigations.
Why we think this paper is great for you:
This paper directly addresses critical risk management methodologies within data science projects, which is essential for ensuring project success and organizational alignment. It offers insights into overcoming common challenges like governance and team misalignment, crucial for effective leadership.
Waseda University
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Summary
  • { "title": "Enterprise Data Science Platform (EDSP)", "description": "A unified data management architecture that addresses data management challenges in multi-query engine environments." } { "term": "Write-Once, Read-Anywhere", "definition": "A principle that enables data to be written once and read from multiple query engines without replication or duplication." } { "title": "EDSP Demonstrates Practical Solution to Data Silos in Multi-Query Engine Enterprises" , "description": "The Enterprise Data Science Platform (EDSP) demonstrates that the Write-Once, Read-Anywhere principle can be realized in production environments, offering a practical solution to the long-standing problem of data silos in multi-query engine enterprises." } { "title": "Limited Performance Validation" , "description": "Future work includes performance validation on TB-scale datasets." } { "title": "Data Lake Architectures and Metadata Management" , "description": "The paper references a study on data lake architectures and metadata management, highlighting the importance of metadata in data sharing across heterogeneous query engines." } The paper proposes the Enterprise Data Science Platform (EDSP), a unified data management architecture grounded in the Write-Once, Read-Anywhere principle, to address data management challenges in multi-query engine environments. [2]
Abstract
Organizations struggle to share data across departments that have adopted different data analytics platforms. If n datasets must serve m environments, up to n*m replicas can emerge, increasing inconsistency and cost. Traditional warehouses copy data into vendor-specific stores; cross-platform access is hard. This study proposes the Enterprise Data Science Platform (EDSP), which builds on data lakehouse architecture and follows a Write-Once, Read-Anywhere principle. EDSP enables federated data access for multi-query engine environments, targeting data science workloads with periodic data updates and query response times ranging from seconds to minutes. By providing centralized data management with federated access from multiple query engines to the same data sources, EDSP eliminates data duplication and vendor lock-in inherent in traditional data warehouses. The platform employs a four-layer architecture: Data Preparation, Data Store, Access Interface, and Query Engines. This design enforces separation of concerns and reduces the need for data migration when integrating additional analytical environments. Experimental results demonstrate that major cloud data warehouses and programming environments can directly query EDSP-managed datasets. We implemented and deployed EDSP in production, confirming interoperability across multiple query engines. For data sharing across different analytical environments, EDSP achieves a 33-44% reduction in operational steps compared with conventional approaches requiring data migration. Although query latency may increase by up to a factor of 2.6 compared with native tables, end-to-end completion times remain on the order of seconds, maintaining practical performance for analytical use cases. Based on our production experience, EDSP provides practical design guidelines for addressing the data-silo problem in multi-query engine environments.
Why we think this paper is great for you:
This paper presents a unified architecture for an Enterprise Data Science Platform, vital for managing federated data access across diverse organizational departments. You will find its discussion on reducing inconsistency and cost in data sharing highly valuable for optimizing data science operations.
Georgia Institute of Tech
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Summary
  • They also discuss the use of Observational Health Data Sciences and Informatics (OHDSI) and the OMOP Common Data Model (CDM) for data harmonization and analysis. [3]
  • Data Quality: refers to the accuracy, completeness, and consistency of data used in analysis or decision-making. [3]
  • Harmonization: refers to the process of converting data from different sources into a common format for analysis or comparison. [3]
  • The article discusses the challenges of implementing artificial intelligence (AI) in healthcare, particularly in terms of data quality and harmonization. [2]
Abstract
The rapid growth of Artificial Intelligence (AI) in healthcare has sparked interest in Trustworthy AI and AI Implementation Science, both of which are essential for accelerating clinical adoption. However, strict regulations, gaps between research and clinical settings, and challenges in evaluating AI systems continue to hinder real-world implementation. This study presents an AI implementation case study within Shriners Childrens (SC), a large multisite pediatric system, showcasing the modernization of SCs Research Data Warehouse (RDW) to OMOP CDM v5.4 within a secure Microsoft Fabric environment. We introduce a Python-based data quality assessment tool compatible with SCs infrastructure, extending OHDsi's R/Java-based Data Quality Dashboard (DQD) and integrating Trustworthy AI principles using the METRIC framework. This extension enhances data quality evaluation by addressing informative missingness, redundancy, timeliness, and distributional consistency. We also compare systematic and case-specific AI implementation strategies for Craniofacial Microsomia (CFM) using the FHIR standard. Our contributions include a real-world evaluation of AI implementations, integration of Trustworthy AI principles into data quality assessment, and insights into hybrid implementation strategies that blend systematic infrastructure with use-case-driven approaches to advance AI in healthcare.
Why we think this paper is great for you:
Focusing on AI implementation science and trustworthy data, this paper provides crucial insights into accelerating the adoption of AI within large systems. It helps you understand the challenges and strategies for deploying reliable AI solutions, a key aspect of managing AI initiatives.
CAS
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Summary
  • DAComp is a comprehensive benchmark designed to evaluate data agents across the full data intelligence lifecycle. [3]
  • The benchmark aims to steer the community beyond mere technical accuracy, driving the evolution of truly autonomous and capable data agents for the enterprise. [3]
  • Data Agent (DA): An LLM-driven autonomous system that plans and executes end-to-end workflows, acquiring, transforming, and analyzing data via tool use and code execution to achieve user-defined objectives. [3]
  • LLM: Large Language Model DAComp: Data Agent Comprehensive Benchmark DAComp is a rigorous standard for evaluating data agents, bridging the gap between isolated code generation and real-world enterprise demands. [3]
  • The benchmark includes two testbeds: DAComp-DE for repository-level pipeline orchestration and DAComp-DA for open-ended analytical reasoning. [2]
Abstract
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io
Why we think this paper is great for you:
This paper introduces a benchmark for data agents across the entire data intelligence lifecycle, from engineering to analysis. You will appreciate its comprehensive approach to evaluating and optimizing complex enterprise data workflows.
Meta
Rate paper: 👍 👎 ♥ Save
AI Summary
  • Factsheets are designed to provide a comprehensive overview of the model's capabilities, limitations, and potential biases. [3]
  • The authors argue that Factsheets can improve transparency, reproducibility, and informed decision-making in AI research. [3]
  • Model Description: A detailed description of the model's architecture, training data, and hyperparameters. [3]
  • Data Statement: A statement that describes the data used to train the model, including its source, quality, and any potential biases. [3]
  • Evaluation Metrics: The metrics used to evaluate the model's performance, such as accuracy, precision, and recall. [3]
  • Use Cases: Examples of how the model can be used in real-world applications. [3]
  • The paper proposes a framework for evaluation of large language models (LLMs) called Factsheets. [2]
Abstract
The rapid proliferation of benchmarks has created significant challenges in reproducibility, transparency, and informed decision-making. However, unlike datasets and models -- which benefit from structured documentation frameworks like Datasheets and Model Cards -- evaluation methodologies lack systematic documentation standards. We introduce Eval Factsheets, a structured, descriptive framework for documenting AI system evaluations through a comprehensive taxonomy and questionnaire-based approach. Our framework organizes evaluation characteristics across five fundamental dimensions: Context (Who made the evaluation and when?), Scope (What does it evaluate?), Structure (With what the evaluation is built?), Method (How does it work?) and Alignment (In what ways is it reliable/valid/robust?). We implement this taxonomy as a practical questionnaire spanning five sections with mandatory and recommended documentation elements. Through case studies on multiple benchmarks, we demonstrate that Eval Factsheets effectively captures diverse evaluation paradigms -- from traditional benchmarks to LLM-as-judge methodologies -- while maintaining consistency and comparability. We hope Eval Factsheets are incorporated into both existing and newly released evaluation frameworks and lead to more transparency and reproducibility.
Why we think this paper is great for you:
This paper proposes a structured framework for documenting AI evaluations, enhancing reproducibility and transparency in AI projects. It offers practical guidance for making informed decisions and managing the quality of AI systems.
Ashoka University
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Summary
  • The Datalake is a cloud-based platform that enables users to perform simple and complex analytical tasks on multiple datasets. [3]
  • It provides an end-to-end solution for data management, including provenance and version tracking. [3]
  • The ease of use of the Datalake is key in democratizing access to data and good data science practices. [3]
  • Datalake: A cloud-based platform that enables users to perform simple and complex analytical tasks on multiple datasets. [3]
  • Provenance: The origin or history of a dataset or its components. [3]
  • Version tracking: The process of keeping track of changes made to a dataset over time. [3]
  • Funding support by Mphasis AI Lab at Ashoka University. [3]
  • Access to data and analysis tools are the most important factors in lowering barriers for NGOs, grassroots organizations, and students who may not be well-versed in using computer science tools for data processing. [2]
  • Limited user engagement with social scientists. [1]
Abstract
Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets. In this paper, we present a Datalake infrastructure tailored to the needs of interdisciplinary social science research. Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis. We demonstrate the utility of our Datalake using real-world use cases spanning governance, health, and education. A detailed walkthrough of one such use case -- analyzing the relationship between income, education, and infant mortality -- shows how our platform streamlines the research process while maintaining transparency and reproducibility. We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations. The Datalake continues to evolve with plans to support ML pipelines, mobile access, and citizen data feedback mechanisms.
Why we think this paper is great for you:
This paper explores the development of a datalake for data-driven research, addressing common barriers like inconsistent data formats and access limitations. Understanding these architectural solutions can inform your strategies for data engineering and management.
Polytechnique Montreal
Rate paper: 👍 👎 ♥ Save
AI Summary
  • The study found that students rely heavily on AI tools for incremental learning and advanced implementation, but have mixed results when using them for initial learning and first implementations. [3]
  • Students emphasized several benefits of GenAI, including brainstorming support, access to diverse examples, and accelerated comprehension of SE concepts. [3]
  • Despite the benefits, students reported substantial challenges with GenAI, particularly in line with categories C3–C5 of the prior framework: mis-alignment with learning preferences, lack of rationale, and difficulty adapting AI responses. [3]
  • The study suggests that while communication difficulties were less dominant, response quality and reliability remain major concerns for students when using GenAI. [3]
  • GenAI: General-purpose artificial intelligence tools that can perform a wide range of tasks, including coding, debugging, and providing explanations. [3]
  • Curricula must explicitly teach verification, testing, and critical assessment of GenAI outputs to mitigate the risks associated with their use. [2]
  • SE education: Software engineering education, which focuses on teaching students how to design, develop, test, and maintain software systems. [1]
Abstract
Context. The rise of generative AI (GenAI) tools like ChatGPT and GitHub Copilot has transformed how software is learned and written. In software engineering (SE) education, these tools offer new opportunities for support, but also raise concerns about over-reliance, ethical use, and impacts on learning. Objective. This study investigates how undergraduate SE students use GenAI tools, focusing on the benefits, challenges, ethical concerns, and instructional expectations that shape their experiences. Method. We conducted a survey with 130 undergraduate students from two universities. The survey combined structured Likert-scale items and open-ended questions to investigate five dimensions: usage context, perceived benefits, challenges, ethical and instructional perceptions. Results. Students most often use GenAI for incremental learning and advanced implementation, reporting benefits such as brainstorming support and confidence-building. At the same time, they face challenges including unclear rationales and difficulty adapting outputs. Students highlight ethical concerns around fairness and misconduct, and call for clearer instructional guidance. Conclusion. GenAI is reshaping SE education in nuanced ways. Our findings underscore the need for scaffolding, ethical policies, and adaptive instructional strategies to ensure that GenAI supports equitable and effective learning.
Why we think this paper is great for you:
This paper examines the impact of GenAI tools in software engineering, offering insights into their opportunities and challenges. It provides valuable context for managing tech teams and integrating new AI technologies responsibly.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Data Science Engineering Management
  • Managing tech teams
  • Data Science Management
  • Managing teams of data scientists
You can edit or add more interests any time.