Papers from 15 to 19 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Data Science Engineering
šŸ‘ šŸ‘Ž ♄ Save
Rice University, Virginia
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Rapid computational developments - particularly the proliferation of artificial intelligence (AI) - increasingly shape social scientific research while raising new questions about in-depth qualitative methods such as ethnography and interviewing. Building on classic debates about using computers to analyze qualitative data, we revisit longstanding concerns and assess possibilities and dangers in an era of automation, AI chatbots, and 'big data.' We first historicize developments by revisiting classical and emergent concerns about qualitative analysis with computers. We then introduce a typology of contemporary modes of engagement - streamlining workflows, scaling up projects, hybrid analytical approaches, and the sociology of computation - alongside rejection of computational analyses. We illustrate these approaches with detailed workflow examples from a large-scale ethnographic study and guidance for solo researchers. We argue for a pragmatic sociological approach that moves beyond dualisms of technological optimism versus rejection to show how computational tools - simultaneously dangerous and generative - can be adapted to support longstanding qualitative aims when used carefully in ways aligned with core methodological commitments.
AI Insights
  • The study maps four AI engagement modes—workflow streamlining, scaling, hybrid analysis, and the sociology of computation—beyond optimism–rejection.
  • A large‑scale ethnographic workflow example shows AI accelerating coding while preserving nuance.
  • Hybrid coding blends human insight with LLM prompts, cutting effort yet boosting reliability.
  • Warnings note that LLMs substituting participants can flatten identity groups, raising ethical stakes.
  • The authors urge empirical tests of AI’s effectiveness, urging comparison with traditional coding rigor.
  • Solo researchers receive step‑by‑step guidance on integrating AI without compromising methodological integrity.
šŸ‘ šŸ‘Ž ♄ Save
University of Michigan
Abstract
Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats. This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption. We illustrate this paradigm through two use cases and present the research components that need to be developed in a new data system called MXFlow.
AI Insights
  • MXFlow’s dynamic dataflow engine orchestrates neural and symbolic operators for seamless cross‑modal transformations.
  • Built‑in cost model predicts query time, guiding optimal operator placement across text, image, and table streams.
  • Unlike ETL, MXFlow supports full read‑write pipelines, enabling in‑place updates to extracted structures before projection.
  • Treating LLMs as first‑class storage, MXFlow merges declarative SQL semantics with generative reasoning over unstructured inputs.
  • Multimodal output layer can generate structured tables, annotated images, and natural‑language summaries simultaneously.
  • See Anderson et al.’s ā€œLLM‑powered unstructured analytics systemā€ paper for practical implementation insights.
Managing tech teams
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Military and economic strategic competitiveness between nation-states will increasingly be defined by the capability and cost of their frontier artificial intelligence models. Among the first areas of geopolitical advantage granted by such systems will be in automating military intelligence. Much discussion has been devoted to AI systems enabling new military modalities, such as lethal autonomous weapons, or making strategic decisions. However, the ability of a country of "CIA analysts in a data-center" to synthesize diverse data at scale, and its implications, have been underexplored. Multimodal foundation models appear on track to automate strategic analysis previously done by humans. They will be able to fuse today's abundant satellite imagery, phone-location traces, social media records, and written documents into a single queryable system. We conduct a preliminary uplift study to empirically evaluate these capabilities, then propose a taxonomy of the kinds of ground truth questions these systems will answer, present a high-level model of the determinants of this system's AI capabilities, and provide recommendations for nation-states to remain strategically competitive within the new paradigm of automated intelligence.
šŸ‘ šŸ‘Ž ♄ Save
University of Glasgow
Abstract
Sustainable Software Engineering (SSE) is slowly becoming an industry need for reasons including reputation enhancement, improved profits and more efficient practices. However, SSE has many definitions, and this is a challenge for organisations trying to build a common and broadly agreed understanding of the term. Although much research effort has gone into identifying general SSE practices, there is a gap in understanding the sustainability needs of specific organisational contexts, such as financial services, which are highly data-driven, operate under strict regulatory requirements, and handle millions of transactions day to day. To address this gap, our research focuses on a financial services company (FinServCo) that invited us to investigate perceptions of sustainability in their IT function: how it could be put into practice, who is responsible for it, and what the challenges are. We conducted an exploratory qualitative case study using interviews and a focus group with six higher management employees and 16 software engineers comprising various experience levels from junior developers to team leaders. Our study found a clear divergence in how sustainability is perceived between organisational levels. Higher management emphasised technical and economic sustainability, focusing on cloud migration and business continuity through data availability. In contrast, developers highlighted human-centric concerns such as workload management and stress reduction. Scepticism toward organisational initiatives was also evident, with some developers viewing them as a PR strategy. Many participants expressed a preference for a dedicated sustainability team, drawing analogies to internal structures for security governance. The disconnect between organisational goals and individual developer needs highlights the importance of context-sensitive, co-designed interventions.
Engineering Management
šŸ‘ šŸ‘Ž ♄ Save
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Large Language Models are transforming software engineering, yet prompt management in practice remains ad hoc, hindering reliability, reuse, and integration into industrial workflows. We present Prompt-with-Me, a practical solution for structured prompt management embedded directly in the development environment. The system automatically classifies prompts using a four-dimensional taxonomy encompassing intent, author role, software development lifecycle stage, and prompt type. To enhance prompt reuse and quality, Prompt-with-Me suggests language refinements, masks sensitive information, and extracts reusable templates from a developer's prompt library. Our taxonomy study of 1108 real-world prompts demonstrates that modern LLMs can accurately classify software engineering prompts. Furthermore, our user study with 11 participants shows strong developer acceptance, with high usability (Mean SUS=73), low cognitive load (Mean NASA-TLX=21), and reported gains in prompt quality and efficiency through reduced repetitive effort. Lastly, we offer actionable insights for building the next generation of prompt management and maintenance tools for software engineering workflows.
Managing teams of data scientists
šŸ‘ šŸ‘Ž ♄ Save
Abstract
The paper bridges two vast areas of research: stochastic team decision problems and convex stochastic programming. New methods developed in the latter are applied to the study of fundamental problems in the former. The main results are concerned with the Lagrangian relaxation of informational and material constraints in convex stochastic team problems.
Data Science Engineering Management
šŸ‘ šŸ‘Ž ♄ Save
University of MilanBicoc
Abstract
Developing reliable data enrichment pipelines demands significant engineering expertise. We present Prompt2DAG, a methodology that transforms natural language descriptions into executable Apache Airflow DAGs. We evaluate four generation approaches -- Direct, LLM-only, Hybrid, and Template-based -- across 260 experiments using thirteen LLMs and five case studies to identify optimal strategies for production-grade automation. Performance is measured using a penalized scoring framework that combines reliability with code quality (SAT), structural integrity (DST), and executability (PCT). The Hybrid approach emerges as the optimal generative method, achieving a 78.5% success rate with robust quality scores (SAT: 6.79, DST: 7.67, PCT: 7.76). This significantly outperforms the LLM-only (66.2% success) and Direct (29.2% success) methods. Our findings show that reliability, not intrinsic code quality, is the primary differentiator. Cost-effectiveness analysis reveals the Hybrid method is over twice as efficient as Direct prompting per successful DAG. We conclude that a structured, hybrid approach is essential for balancing flexibility and reliability in automated workflow generation, offering a viable path to democratize data pipeline development.
šŸ‘ šŸ‘Ž ♄ Save
Brookhaven National Lab
Abstract
The collaborative efforts of large communities in science experiments, often comprising thousands of global members, reflect a monumental commitment to exploration and discovery. Recently, advanced and complex data processing has gained increasing importance in science experiments. Data processing workflows typically consist of multiple intricate steps, and the precise specification of resource requirements is crucial for each step to allocate optimal resources for effective processing. Estimating resource requirements in advance is challenging due to a wide range of analysis scenarios, varying skill levels among community members, and the continuously increasing spectrum of computing options. One practical approach to mitigate these challenges involves initially processing a subset of each step to measure precise resource utilization from actual processing profiles before completing the entire step. While this two-staged approach enables processing on optimal resources for most of the workflow, it has drawbacks such as initial inaccuracies leading to potential failures and suboptimal resource usage, along with overhead from waiting for initial processing completion, which is critical for fast-turnaround analyses. In this context, our study introduces a novel pipeline of machine learning models within a comprehensive workflow management system, the Production and Distributed Analysis (PanDA) system. These models employ advanced machine learning techniques to predict key resource requirements, overcoming challenges posed by limited upfront knowledge of characteristics at each step. Accurate forecasts of resource requirements enable informed and proactive decision-making in workflow management, enhancing the efficiency of handling diverse, complex workflows across heterogeneous resources.
AI Insights
  • PanDA now runs a full ML pipeline that predicts memory, CPU, I/O, and walltime with sub‑second latency.
  • 70 % of tasks are predicted within 5 % of actual usage, cutting idle time dramatically.
  • Future work includes clustering task attributes, adding domain knowledge, and a feedback loop for continuous model refinement.
  • Transfer learning across diverse scientific workflows is proposed to generalize the models beyond the current dataset.
  • The authors cite ā€œpipecomp, a General Framework for the Evaluation of Computational Pipelinesā€ and recommend ā€œRobust Performance Metrics for Imbalanced Classification Problemsā€ for deeper evaluation.
Data Science Management
šŸ‘ šŸ‘Ž ♄ Save
arXiv250913436v1 csSE
Abstract
As research increasingly relies on computational methods, the reliability of scientific results depends on the quality, reproducibility, and transparency of research software. Ensuring these qualities is critical for scientific integrity and discovery. This paper asks whether Research Software Science (RSS)--the empirical study of how research software is developed and used--should be considered a form of metascience, the science of science. Classification matters because it could affect recognition, funding, and integration of RSS into research improvement. We define metascience and RSS, compare their principles and objectives, and examine their overlaps. Arguments for classification highlight shared commitments to reproducibility, transparency, and empirical study of research processes. Arguments against portraying RSS as a specialized domain focused on a tool rather than the broader scientific enterprise. Our analysis finds RSS advances core goals of metascience, especially in computational reproducibility, and bridges technical, social, and cognitive aspects of research. Its classification depends on whether one adopts a broad definition of metascience--any empirical effort to improve science--or a narrow one focused on systemic and epistemological structures. We argue RSS is best understood as a distinct interdisciplinary domain that aligns with, and in some definitions fits within, metascience. Recognizing it as such can strengthen its role in improving reliability, justify funding, and elevate software development in research institutions. Regardless of classification, applying scientific rigor to research software ensures the tools of discovery meet the standards of the discoveries themselves.
AI Insights
  • RSS adopts empirical methods akin to Empirical Software Engineering to quantify software quality metrics.
  • The field’s core contribution is a reproducibility framework that maps software artifacts to experimental protocols.
  • Literature such as Bennett’s An Introduction to Metascience and Mausfeld’s Epsilon‑Metascience contextualizes RSS within broader meta‑research debates.
  • Ziemann et al.’s Five Pillars of Computational Reproducibility provides a practical checklist that RSS researchers routinely apply.
  • The FORRT framework offers a training curriculum that integrates open‑source practices with rigorous reproducibility standards.
  • RSS is positioned as an interdisciplinary bridge, linking cognitive science, sociology of science, and software engineering.
  • Recognizing RSS as a distinct domain can unlock targeted funding streams and institutional support for research software development.
AI for Data Science Management
šŸ‘ šŸ‘Ž ♄ Save
Aalto University, Espoo
Abstract
Since its launch in late 2022, ChatGPT has ignited widespread interest in Large Language Models (LLMs) and broader Artificial Intelligence (AI) solutions. As this new wave of AI permeates various sectors of society, we are continually uncovering both the potential and the limitations of existing AI tools. The need for adjustment is particularly significant in Computer Science Education (CSEd), as LLMs have evolved into core coding tools themselves, blurring the line between programming aids and intelligent systems, and reinforcing CSEd's role as a nexus of technology and pedagogy. The findings of our survey indicate that while AI technologies hold potential for enhancing learning experiences, such as through personalized learning paths, intelligent tutoring systems, and automated assessments, there are also emerging concerns. These include the risk of over-reliance on technology, the potential erosion of fundamental cognitive skills, and the challenge of maintaining equitable access to such innovations. Recent advancements represent a paradigm shift, transforming not only the content we teach but also the methods by which teaching and learning take place. Rather than placing the burden of adapting to AI technologies on students, educational institutions must take a proactive role in verifying, integrating, and applying new pedagogical approaches. Such efforts can help ensure that both educators and learners are equipped with the skills needed to navigate the evolving educational landscape shaped by these technological innovations.
AI Insights
  • A meta‑analysis by Wang & Fan shows ChatGPT boosts higher‑order thinking, yet sample sizes remain small.
  • Tianjia Wang et al.’s study on AI assistants reveals mixed instructor perceptions of code‑generation reliability.
  • Bias in LLM outputs is a documented weakness; recent audits report up to 30% demographic skew in code suggestions.
  • Accuracy concerns are highlighted by a 2023 audit that found 15% of ChatGPT‑generated solutions contained logical errors.
  • Educators must acquire ā€œAI fluencyā€ to design equitable prompts, a skill gap identified in the survey’s open‑ended responses.
  • ā€œArtificial Intelligence in Higher Educationā€ by Zeide offers a framework for ethical deployment, recommended for curriculum designers.
  • Interconnected.org and Educause Review provide up‑to‑date case studies on AI‑driven grading pilots across universities.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • AI for Data Science Engineering
You can edit or add more interests any time.

Unsubscribe from these updates