Papers from 06 to 10 October, 2025

Here are the personalized paper recommendations sorted by most relevant
Data Science Development Environment and Productivity
👍 👎 ♥ Save
Abstract
Generative AI solutions like GitHub Copilot have been shown to increase the productivity of software developers. Yet prior work remains unclear on the quality of code produced and the challenges of maintaining it in software projects. If quality declines as volume grows, experienced developers face increased workloads reviewing and reworking code from less-experienced contributors. We analyze developer activity in Open Source Software (OSS) projects following the introduction of GitHub Copilot. We find that productivity indeed increases. However, the increase in productivity is primarily driven by less-experienced (peripheral) developers. We also find that code written after the adoption of AI requires more rework. Importantly, the added rework burden falls on the more experienced (core) developers, who review 6.5% more code after Copilot's introduction, but show a 19% drop in their original code productivity. More broadly, this finding raises caution that productivity gains of AI may mask the growing burden of maintenance on a shrinking pool of experts.
Machine Learning Operations
👍 👎 ♥ Save
Dhirubhai Ambani Universt
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Accurate query runtime prediction is a critical component of effective query optimization in modern database systems. Traditional cost models, such as those used in PostgreSQL, rely on static heuristics that often fail to reflect actual query performance under complex and evolving workloads. This remains an active area of research, with recent work exploring machine learning techniques to replace or augment traditional cost estimators. In this paper, we present a machine learning-based framework for predicting SQL query runtimes using execution plan features extracted from PostgreSQL. Our approach integrates scalar and structural features from execution plans and semantic representations of SQL queries to train predictive models. We construct an automated pipeline for data collection and feature extraction using parameterized TPC-H queries, enabling systematic evaluation of multiple modeling techniques. Unlike prior efforts that focus either on cardinality estimation or on synthetic cost metrics, we model the actual runtimes using fine-grained plan statistics and query embeddings derived from execution traces, to improve the model accuracy. We compare baseline regressors, a refined XGBoost model, and a sequential LSTM-based model to assess their effectiveness in runtime prediction. Our dataset includes over 1000 queries generated from TPC-H query templates executed in PostgreSQL with EXPLAIN ANALYZE. Experimental results show that the XGBoost model significantly outperforms others, achieving a mean squared error of 0.3002 and prediction accuracy within 10% of the true runtime in over 65% of cases. The findings highlight the potential of tree-based learning combined with execution plan features for improving cost estimation in query optimizers.
AI Insights
  • The reproducible pipeline lets researchers replicate experiments across DBMSs, boosting open‑source query‑optimizer research.
  • The framework supports multi‑system integration, enabling quick porting to engines like Oracle or MySQL.
  • Adaptive learning is flagged as future work, hinting at online model updates as workloads shift.
  • A noted weakness is the absence of scalability tests, urging larger‑scale real‑world benchmarking.
  • ML is defined as statistical models that learn from data without explicit programming, matching AI taxonomy.
  • The authors also trialed a sequential LSTM, showing deep learning’s potential despite tree‑based models winning.
👍 👎 ♥ Save
Eindhoven University of T
Abstract
In this work, we present LOTUS (Learning to Learn with Optimal Transport for Unsupervised Scenarios), a simple yet effective method to perform model selection for multiple unsupervised machine learning(ML) tasks such as outlier detection and clustering. Our intuition behind this work is that a machine learning pipeline will perform well in a new dataset if it previously worked well on datasets with a similar underlying data distribution. We use Optimal Transport distances to find this similarity between unlabeled tabular datasets and recommend machine learning pipelines with one unified single method on two downstream unsupervised tasks: outlier detection and clustering. We present the effectiveness of our approach with experiments against strong baselines and show that LOTUS is a very promising first step toward model selection for multiple unsupervised ML tasks.
Machine Learning Lifecycle
👍 👎 ♥ Save
Data Science problems
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Analytics play an important role in modern business. Companies adapt data science lifecycles to their culture to seek productivity and improve their competitiveness among others. Data science lifecycles are fairly an important contributing factor to start and end a project that are data dependent. Data science and Machine learning life cycles comprises of series of steps that are involved in a project. A typical life cycle states that it is a linear or cyclical model that revolves around. It is mostly depicted that it is possible in a traditional data science life cycle to start the process again after reaching the end of cycle. This paper suggests a new technique to incorporate data science life cycle to business problems that have a clear end goal. A new technique called spiral technique is introduced to emphasize versatility, agility and iterative approach to business processes.
AI Insights
  • The Spiral Model embeds exit checkpoints after each revolution, letting teams stop when business‑defined thresholds are met.
  • By turning the ML lifecycle into a goal‑driven spiral, teams gain accountability and avoid unbounded iteration costs.
  • In a turnover‑prediction case study, the spiral halted once ROC‑AUC ≥ 0.85, proving exit‑criteria efficacy.
  • The technique shines when projects have clear exit criteria, such as designing retention policies or forecasting churn.
  • Weaknesses arise when objectives shift mid‑cycle; careful checkpoint design is essential to maintain alignment.
  • Recommended reading: “Hidden Technical Debt in Machine Learning Systems” and “Software Engineering for Machine Learning” for deeper process insights.
  • Exit checkpoints are business‑defined criteria that signal when a project can be terminated, ensuring resource efficiency.
👍 👎 ♥ Save
Abstract
Recent Machine Learning (ML) approaches have shown increased performance on benchmarks but at the cost of escalating computational demands. Hardware, algorithmic and carbon optimizations have been proposed to curb energy consumption and environmental impacts. Can these strategies lead to sustainable ML model training? Here, we estimate the environmental impacts associated with training notable AI systems over the last decade, including Large Language Models, with a focus on the life cycle of graphics cards. Our analysis reveals two critical trends: First, the impacts of graphics cards production have increased steadily over this period; Second, energy consumption and environmental impacts associated with training ML models have increased exponentially, even when considering reduction strategies such as location shifting to places with less carbon intensive electricity mixes. Optimization strategies do not mitigate the impacts induced by model training, evidencing rebound effect. We show that the impacts of hardware must be considered over the entire life cycle rather than the sole use phase in order to avoid impact shifting. Our study demonstrates that increasing efficiency alone cannot ensure sustainability in ML. Mitigating the environmental impact of AI also requires reducing AI activities and questioning the scale and frequency of resource-intensive training.
Model Monitoring
👍 👎 ♥ Save
Allianz VersicherungsAG
Paper visualization
Rate this image: 😍 👍 👎
Abstract
In a dynamic landscape where portfolios and environments evolve, maintaining the accuracy of pricing models is critical. To the best of our knowledge, this is the first study to systematically examine concept drift in non-life insurance pricing. We (i) provide an overview of the relevant literature and commonly used methodologies, clarify the distinction between virtual drift and concept drift, and explain their implications for long-run model performance; (ii) review and formalize common performance measures, including the Gini index and deviance loss, and articulate their interpretation; (iii) derive the asymptotic distribution of the Gini index, enabling valid inference and hypothesis testing; and (iv) present a standardized monitoring procedure that indicates when refitting is warranted. We illustrate the framework using a modified real-world portfolio with induced concept drift and discuss practical considerations and pitfalls.
👍 👎 ♥ Save
Tennessee Tech University
Abstract
In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model compression. While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying quantization and pruning to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through compression do not compromise the fairness or faithfulness essential for trustworthy AI.
Machine Learning Deployment
👍 👎 ♥ Save
InstaDeep
Abstract
Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git
AI Insights
  • InstaGeo supports multi‑temporal foundation models like Prithvi‑Eo‑2.0, enabling land‑cover, crop, and climate analysis from diverse satellite sources.
  • Its pipeline ingests raw Landsat, Sentinel‑2, and MODIS imagery, auto‑converting it into model‑ready datasets.
  • Task‑specific knowledge distillation compresses GFMs up to eight times smaller, cutting FLOPs and CO₂ with minimal accuracy loss.
  • All code, datasets, and distilled checkpoints are open‑source on GitHub, boosting reproducibility.
  • Researchers can deploy distilled models as interactive web‑maps within a single working day.
  • InstaGeo unifies curation, compression, and deployment, steering geospatial AI toward low‑carbon, data‑quality solutions.
Machine Learning Resilience
👍 👎 ♥ Save
Linkping University, SE
Abstract
For the signed graph associated to a deep neural network, one can compute the frustration level, i.e., test how close or distant the graph is to structural balance. For all the pretrained deep convolutional neural networks we consider, we find that the frustration is always less than expected from null models. From a statistical physics point of view, and in particular in reference to an Ising spin glass model, the reduced frustration indicates that the amount of disorder encoded in the network is less than in the null models. From a functional point of view, low frustration (i.e., proximity to structural balance) means that the function representing the network behaves near-monotonically, i.e., more similarly to a monotone function than in the null models. Evidence of near-monotonic behavior along the partial order determined by frustration is observed for all networks we consider. This confirms that the class of deep convolutional neural networks tends to have a more ordered behavior than expected from null models, and suggests a novel form of implicit regularization.
Data Science Development Tools
👍 👎 ♥ Save
George Mason University
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.
AI Insights
  • The tool library boosts weaker LLMs, enabling them to solve complex problems that would otherwise be out of reach.
  • Current tools deliver sufficient functionality, yet their lack of robustness leads to occasional errors.
  • Enhancing user‑friendliness—e.g., clearer interfaces and error handling—could reduce misuse by less‑experienced LLMs.
  • Robustness is defined as the library’s ability to handle unexpected inputs without failure.
  • A curated set of foundational papers (BERT, RoBERTa, Attention Is All You Need) offers a solid backdrop for extending tool capabilities.
  • Future work should focus on automated robustness testing to preempt edge‑case failures.
MLOps
👍 👎 ♥ Save
Abstract
Organizational efforts to utilize and operationalize artificial intelligence (AI) are often accompanied by substantial challenges, including scalability, maintenance, and coordination across teams. In response, the concept of Machine Learning Operations (MLOps) has emerged as a set of best practices that integrate software engineering principles with the unique demands of managing the ML lifecycle. Yet, empirical evidence on whether and how these practices support users in developing and operationalizing AI applications remains limited. To address this gap, this study analyzes over 8,000 user reviews of AI development platforms from G2.com. Using zero-shot classification, we measure review sentiment toward nine established MLOps practices, including continuous integration and delivery (CI/CD), workflow orchestration, reproducibility, versioning, collaboration, and monitoring. Seven of the nine practices show a significant positive relationship with user satisfaction, suggesting that effective MLOps implementation contributes tangible value to AI development. However, organizational context also matters: reviewers from small firms discuss certain MLOps practices less frequently, suggesting that organizational context influences the prevalence and salience of MLOps, though firm size does not moderate the MLOps-satisfaction link. This indicates that once applied, MLOps practices are perceived as universally beneficial across organizational settings.
Fault tolerance
👍 👎 ♥ Save
University of New Mexico
Paper visualization
Rate this image: 😍 👍 👎
Abstract
We propose a scheme for the fault-tolerant implementation of arbitrary Clifford circuits. To achieve this, we extend previous work on flag gadgets for syndrome extraction to a general framework that flags any Clifford circuit. This framework opens new pathways toward universal fault tolerance by allowing transversal implementation of $T$ gates alongside fault-tolerant realization of selected non-transversal Clifford gates using flags. The construction we present allows a Clifford circuit consisting of $n$ two-qubit gates and $O(n)$ single-qubit gates acting upon physical qubits in a code of distance $d$ to be made fault tolerant to distance $d$ using $O(d^2 \log(nd^2\log n))$ ancilla qubits and $O(nd^2 \log(nd^2 \log n))$ extra CNOTs. Beyond asymptotic analysis, we demonstrate our construction by implementing the non-transversal logical Hadamard gate for the [[15,1,3]] code, which has transversal T, and compare to alternative approaches for universality using this code. We also apply our construction to magic-state preparation, general state preparation using Clifford circuits, and data-syndrome codes.
AI Insights
  • Guarantees fault‑tolerance for any Clifford circuit, beyond code‑specific gadgets.
  • Uses binary matrices A, A′, B, B′ to encode X‑type flags, meta‑flags, Z‑type flags, and meta‑flags.
  • Random search over these matrices found flag sets that beat unflagged Hadamard in logical error rate.
  • Adding algebraic structure to the matrices further suppresses errors, hinting at a design‑robustness link.
  • Only space‑time stabilizer products are measured, preventing logical leakage during syndrome extraction.
  • Provides an upper‑bound resource estimate; any tailored flag gadget will use no more ancilla or CNOTs than this scheme.
  • The framework also covers magic‑state prep, arbitrary state synthesis, and data‑syndrome codes, unifying fault‑tolerant primitives.
👍 👎 ♥ Save
Abstract
In this paper we represent a new framework for integrated distributed and reliable systems. In the proposed framework we have used three parts to increase Satisfaction and Performance of this framework. At first we analyze previous frameworks related to integrated systems, then represent new proposed framework in order to improving previous framework, and we discuss its different phases. Finally we compare the results of simulation of the new framework with previous ones. In FIDRS framework, the technique of heterogeneous distributed data base is used to improve Performance and speed in responding to users and in this way we can improve dependability and reliability of framework simultaneously. In extraction phase of the new framework we have used RMSD algorithm that decreases responding time in big database. Finally by using FDIRS framework we succeeded to increase Efficiency, Performance and reliability of integrated systems and remove some of previous frameworks problems.
Machine Learning Validation
👍 👎 ♥ Save
Stanford University
Abstract
Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, we propose a new method to estimate the performance of a model trained on a specific (random) training set. A naive estimator can be obtained by applying the model to a disjoint testing set. Surprisingly, cross-validation estimators computed from other random splits can be used to improve this naive estimator within a random-effects model framework. We develop two estimators -- a hierarchical Bayesian estimator and an empirical Bayes estimator -- that perform similarly to or better than both the conventional cross-validation estimator and the naive single-split estimator. Simulations and a real-data example demonstrate the superior performance of the proposed method.
👍 👎 ♥ Save
Abstract
AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform "classical" attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners' ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.
Online inference
👍 👎 ♥ Save
Scale AI, University of A
Abstract
Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
AI Insights
  • OnlineRubrics refines evaluation criteria on the fly by comparing current and reference policy responses, turning static rubrics into a living, adaptive system.
  • Each rubric item is a binary, mutually exclusive, and collectively exhaustive check—yes or no—ensuring objective, unambiguous scoring.
  • The elicited rubric themes—transparency, practicality, organization, reasoning—mirror emergent priorities that surface during training.
  • Dynamic rubric refinement cuts reward‑hacking risks and boosts performance by up to 8 % on benchmarks like AlpacaEval and GPQA.
  • Key literature—attention mechanisms, BERT pre‑training, deep learning surveys—grounds the method in state‑of‑the‑art NLP research.
👍 👎 ♥ Save
Abstract
The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.
Machine Learning Testing
👍 👎 ♥ Save
Abstract
Search-based test-generation algorithms have countless configuration options. Users rarely adjust these options and usually stick to the default values, which may not lead to the best possible results. Tuning an algorithm's hyperparameters is a method to find better hyperparameter values, but it typically comes with a high demand of resources. Meta-heuristic search algorithms -- that effectively solve the test-generation problem -- have been proposed as a solution to also efficiently tune parameters. In this work we explore the use of differential evolution as a means for tuning the hyperparameters of the DynaMOSA and MIO many-objective search algorithms as implemented in the Pynguin framework. Our results show that significant improvement of the resulting test suite's coverage is possible with the tuned DynaMOSA algorithm and that differential evolution is more efficient than basic grid search.
👍 👎 ♥ Save
Chinese Academy of Scienc
Abstract
MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the correctness and robustness of MLIR itself remains challenging. Existing fuzzing approaches-based on manually crafted templates or rule-based mutations-struggle to generate sufficiently diverse and semantically valid test cases, making it difficult to expose subtle or deep-seated bugs within MLIR's complex and evolving code space. In this paper, we present FLEX, a novel self-adaptive fuzzing framework for MLIR. FLEX leverages neural networks for program generation, a perturbed sampling strategy to encourage diversity, and a feedback-driven augmentation loop that iteratively improves its model using both crashing and non-crashing test cases. Starting from a limited seed corpus, FLEX progressively learns valid syntax and semantics and autonomously produces high-quality test inputs. We evaluate FLEX on the upstream MLIR compiler against four state-of-the-art fuzzers. In a 30-day campaign, FLEX discovers 80 previously unknown bugs-including multiple new root causes and parser bugs-while in 24-hour fixed-revision comparisons, it detects 53 bugs (over 3.5x as many as the best baseline) and achieves 28.2% code coverage, outperforming the next-best tool by 42%. Ablation studies further confirm the critical role of both perturbed generation and diversity augmentation in FLEX's effectiveness.
AI Insights
  • Large language models synthesize syntactically valid MLIR programs, cutting seed‑creation effort.
  • Perturbed sampling injects controlled noise, boosting semantic diversity beyond classic mutation.
  • A feedback‑driven loop retrains the model on crashing and non‑crashing inputs, enabling autonomous self‑improvement.
  • Ablation shows removing perturbed generation drops bug yield by ~40%, proving its necessity.
  • The authors suggest extending the method to deep‑learning compilers and JavaScript engines, hinting at wide applicability.
  • Core terms: MLIR – a multi‑level intermediate representation; Compiler Fuzzing – automated random‑program testing to expose compiler faults.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Machine Learning Infrastructure
You can edit or add more interests any time.

Unsubscribe from these updates