Papers from 13 to 17 October, 2025

Here are the personalized paper recommendations sorted by most relevant
Data Science Engineering
👍 👎 ♄ Save
Beijing Institute of Tech
Paper visualization
Rate this image: 😍 👍 👎
Abstract
A growing trend in modern data analysis is the integration of data management with learning, guided by accuracy, latency, and cost requirements. In practice, applications draw data of different formats from many sources. In the meanwhile, the objectives and budgets change over time. Existing systems handle these applications across databases, analysis libraries, and tuning services. Such fragmentation leads to complex user interaction, limited adaptability, suboptimal performance, and poor extensibility across components. To address these challenges, we present Aixel, a unified, adaptive, and extensible system for AI-powered data analysis. The system organizes work across four layers: application, task, model, and data. The task layer provides a declarative interface to capture user intent, which is parsed into an executable operator plan. An optimizer compiles and schedules this plan to meet specified goals in accuracy, latency, and cost. The task layer coordinates the execution of data and model operators, with built-in support for reuse and caching to improve efficiency. The model layer offers versioned storage for index, metadata, tensors, and model artifacts. It supports adaptive construction, task-aligned drift detection, and safe updates that reuse shared components. The data layer provides unified data management capabilities, including indexing, constraint-aware discovery, task-aligned selection, and comprehensive feature management. With the above designed layers, Aixel delivers a user friendly, adaptive, efficient, and extensible system.
👍 👎 ♄ Save
HumboldtUniversitt zu
Abstract
Over the past decade, the proliferation of public and enterprise data lakes has fueled intensive research into data discovery, aiming to identify the most relevant data from vast and complex corpora to support diverse user tasks. Significant progress has been made through the development of innovative index structures, similarity measures, and querying infrastructures. Despite these advances, a critical aspect remains overlooked: relevance is time-varying. Existing discovery methods largely ignore this temporal dimension, especially when explicit date/time metadata is missing. To fill this gap, we outline a vision for a data discovery system that incorporates the temporal dimension of data. Specifically, we define the problem of temporally-valid data discovery and argue that addressing it requires techniques for version discovery, temporal lineage inference, change log synthesis, and time-aware data discovery. We then present a system architecture to deliver these techniques, before we summarize research challenges and opportunities. As such, we lay the foundation for a new class of data discovery systems, transforming how we interact with evolving data lakes.
AI Insights
  • Blend’s hybrid cache mixes in‑memory storage with on‑demand time travel, cutting latency for large‑scale discovery.
  • A classifier separates unrelated datasets from temporally linked versions, enabling precise lineage construction.
  • Heuristics extract change logs from version histories, giving users a navigable timeline of data evolution.
  • Content‑based and version‑specific queries are decoupled, letting the system infer query intent from minimal input.
  • The architecture infers temporal context even without explicit timestamps, using schema and metadata cues.
  • Recommended: “Delta Lake: High‑performance ACID table storage over cloud object stores” for robust versioning foundations.
Engineering Management
👍 👎 ♄ Save
Karlsruhe Institute of T
Paper visualization
Rate this image: 😍 👍 👎
Abstract
The \emph{Dominating $H$-Pattern} problem generalizes the classical $k$-Dominating Set problem: for a fixed \emph{pattern} $H$ and a given graph $G$, the goal is to find an induced subgraph $S$ of $G$ such that (1) $S$ is isomorphic to $H$, and (2) $S$ forms a dominating set in $G$. Fine-grained complexity results show that on worst-case inputs, any significant improvement over the naive brute-force algorithm is unlikely, as this would refute the Strong Exponential Time Hypothesis. Nevertheless, a recent work by Dransfeld et al. (ESA 2025) reveals some significant improvement potential particularly in \emph{sparse} graphs. We ask: Can algorithms with conditionally almost-optimal worst-case performance solve the Dominating $H$-Pattern, for selected patterns $H$, efficiently on practical inputs? We develop and experimentally evaluate several approaches on a large benchmark of diverse datasets, including baseline approaches using the Glasgow Subgraph Solver (GSS), the SAT solver Kissat, and the ILP solver Gurobi. Notably, while a straightforward implementation of the algorithms -- with conditionally close-to-optimal worst-case guarantee -- performs comparably to existing solvers, we propose a tailored Branch-\&-Bound approach -- supplemented with careful pruning techniques -- that achieves improvements of up to two orders of magnitude on our test instances.
👍 👎 ♄ Save
Monash University, Prince
Abstract
Software practitioners are discussing GenAI transformations in software project management openly and widely. To understand the state of affairs, we performed a grey literature review using 47 publicly available practitioner sources including blogs, articles, and industry reports. We found that software project managers primarily perceive GenAI as an "assistant", "copilot", or "friend" rather than as a "PM replacement", with support of GenAI in automating routine tasks, predictive analytics, communication and collaboration, and in agile practices leading to project success. Practitioners emphasize responsible GenAI usage given concerns such as hallucinations, ethics and privacy, and lack of emotional intelligence and human judgment. We present upskilling requirements for software project managers in the GenAI era mapped to the Project Management Institute's talent triangle. We share key recommendations for both practitioners and researchers.
AI Insights
  • GenAI can automate routine PM tasks, freeing managers to focus on strategic decisions.
  • Predictive analytics from GenAI surface risk hotspots before they materialize.
  • Bias in language models can skew backlog prioritization, demanding bias‑mitigation protocols.
  • Data quality gaps in source artifacts undermine GenAI’s planning accuracy, necessitating rigorous data hygiene.
  • Continuous model retraining is essential to keep GenAI outputs aligned with evolving project contexts.
  • Key literature: “Augmented Agile: Human‑Centered AI‑Assisted Software Management” and “From Backlogs to Bots: GenAI’s Impact on Agile Role Evolution.”
  • Definition: GenAI generates novel content from inputs, while Software PM orchestrates resources to deliver software.
AI for Data Science Management
👍 👎 ♄ Save
Argonne National Lab, The
Paper visualization
Rate this image: 😍 👍 👎
Abstract
We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains "hot" nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
AI Insights
  • Sophia’s hybrid caching‑pruning‑quantization pipeline cuts memory use by 70% while keeping latency under 10 ms on a 32‑GPU node.
  • Globus Auth lets researchers spin up private LLM instances with a single click, preserving data sovereignty.
  • Benchmarks show Sophia outperforms vLLM by 1.8× throughput on GPT‑4‑like workloads, thanks to PagedAttention.
  • FIRST’s “hot” node strategy keeps GPUs pre‑loaded, enabling sub‑second interactive inference for simulations.
  • Must‑read: “Large Language Models: A Survey” and “Efficient Memory Management for Large Language Model Serving with PagedAttention”.
  • LLM – a neural network trained on billions of tokens to generate coherent, context‑aware text.
  • Globus Auth – a research identity platform that grants fine‑grained access control for distributed AI services.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Managing tech teams
  • Managing teams of data scientists
  • Data Science Engineering Management
  • Data Science Management
  • AI for Data Science Engineering
You can edit or add more interests any time.

Unsubscribe from these updates