Papers from 22 to 26 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Data Science Development Environment and Productivity
šŸ‘ šŸ‘Ž ♄ Save
1mgcom
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale. Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities into their daily workflows. Through rigorous cohort analysis, our study demonstrates statistically significant productivity improvements, including an overall 31.8% reduction in PR review cycle time. Developer adoption was strong, with 85% satisfaction for code review features and 93% expressing a desire to continue using the platform. Adoption patterns showed systematic scaling from 4% engagement in month 1 to 83% peak usage by month 6, stabilizing at 60% active engagement. Top adopters achieved a 61% increase in code volume pushed to production, contributing to approximately 30 to 40% of code shipped to production through this tool, accounting for an overall 28% increase in code shipment volume. Unlike controlled benchmark evaluations, our longitudinal analysis provides empirical evidence from production environments, revealing both the transformative potential and practical deployment challenges of integrating AI into enterprise software development workflows.
AI Insights
  • Propensity score matching balanced productivity across teams, revealing nuanced adoption effects.
  • Multilevel modeling controlled for team‑level variance, isolating true productivity gains.
  • Data quality, bias, and transparency surfaced as the top challenges for AI code review.
  • Fine‑tuned transformer and LLM reviewers improved accuracy but risked overfitting.
  • Code‑generation usage lagged behind review, hinting at a trust gap developers must bridge.
  • Cohen‑style power analysis confirmed the 31.8% cycle‑time reduction was statistically robust.
  • Long Code Arena and Qiu et al.’s benchmarks set a rigorous baseline for long‑context code model evaluation.
šŸ‘ šŸ‘Ž ♄ Save
SageBionetworks, OregonHe
Abstract
Continuous and reliable access to curated biological data repositories is indispensable for accelerating rigorous scientific inquiry and fostering reproducible research. Centralized repositories, though widely used, are vulnerable to single points of failure arising from cyberattacks, technical faults, natural disasters, or funding and political uncertainties. This can lead to widespread data unavailability, data loss, integrity compromises, and substantial delays in critical research, ultimately impeding scientific progress. Centralizing essential scientific resources in a single geopolitical or institutional hub is inherently dangerous, as any disruption can paralyze diverse ongoing research. The rapid acceleration of data generation, combined with an increasingly volatile global landscape, necessitates a critical re-evaluation of the sustainability of centralized models. Implementing federated and decentralized architectures presents a compelling and future-oriented pathway to substantially strengthen the resilience of scientific data infrastructures, thereby mitigating vulnerabilities and ensuring the long-term integrity of data. Here, we examine the structural limitations of centralized repositories, evaluate federated and decentralized models, and propose a hybrid framework for resilient, FAIR, and sustainable scientific data stewardship. Such an approach offers a significant reduction in exposure to governance instability, infrastructural fragility, and funding volatility, and also fosters fairness and global accessibility. The future of open science depends on integrating these complementary approaches to establish a globally distributed, economically sustainable, and institutionally robust infrastructure that safeguards scientific data as a public good, further ensuring continued accessibility, interoperability, and preservation for generations to come.
AI Insights
  • EOSC’s federated nodes already host 1 million genomes, a living model of distributed stewardship.
  • ELIXIR’s COVID‑19 response proved community pipelines can scale to pandemic‑grade data volumes.
  • The Global Biodata Coalition’s roadmap envisions a cross‑border mesh that outpaces single‑point failure risks.
  • DeSci employs blockchain provenance to give researchers immutable audit trails for every dataset.
  • NIH’s Final Data Policy now mandates FAIR compliance, nudging institutions toward hybrid decentralized architectures.
  • DeSci still struggles with interoperability, as heterogeneous metadata schemas block seamless cross‑platform queries.
  • Privacy‑by‑design in distributed repositories remains a top research gap, inviting novel cryptographic solutions.
Machine Learning Operations
šŸ‘ šŸ‘Ž ♄ Save
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Learning, whether natural or artificial, is a process of selection. It starts with a set of candidate options and selects the more successful ones. In the case of machine learning the selection is done based on empirical estimates of prediction accuracy of candidate prediction rules on some data. Due to randomness of data sampling the empirical estimates are inherently noisy, leading to selection under uncertainty. The book provides statistical tools to obtain theoretical guarantees on the outcome of selection under uncertainty. We start with concentration of measure inequalities, which are the main statistical instrument for controlling how much an empirical estimate of expectation of a function deviates from the true expectation. The book covers a broad range of inequalities, including Markov's, Chebyshev's, Hoeffding's, Bernstein's, Empirical Bernstein's, Unexpected Bernstein's, kl, and split-kl. We then study the classical (offline) supervised learning and provide a range of tools for deriving generalization bounds, including Occam's razor, Vapnik-Chervonenkis analysis, and PAC-Bayesian analysis. The latter is further applied to derive generalization guarantees for weighted majority votes. After covering the offline setting, we turn our attention to online learning. We present the space of online learning problems characterized by environmental feedback, environmental resistance, and structural complexity. A common performance measure in online learning is regret, which compares performance of an algorithm to performance of the best prediction rule in hindsight, out of a restricted set of prediction rules. We present tools for deriving regret bounds in stochastic and adversarial environments, and under full information and bandit feedback.
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Gravity inversion is the problem of estimating subsurface density distributions from observed gravitational field data. We consider the two-dimensional (2D) case, in which recovering density models from one-dimensional (1D) measurements leads to an underdetermined system with substantially more model parameters than measurements, making the inversion ill-posed and non-unique. Recent advances in machine learning have motivated data-driven approaches for gravity inversion. We first design a convolutional neural network (CNN) trained to directly map gravity anomalies to density fields, where a customized data structure is introduced to enhance the inversion performance. To further investigate generative modeling, we employ Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), reformulating inversion as a latent-space optimization constrained by the forward operator. In addition, we assess whether classical iterative solvers such as Gradient Descent (GD), GMRES, LGMRES, and a recently proposed Improved Conjugate Gradient (ICG) method can refine CNN-based initial guesses and improve inversion accuracy. Our results demonstrate that CNN inversion not only provides the most reliable reconstructions but also significantly outperforms previously reported methods. Generative models remain promising but unstable, and iterative solvers offer only marginal improvements, underscoring the persistent ill-posedness of gravity inversion.
Machine Learning Lifecycle
šŸ‘ šŸ‘Ž ♄ Save
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Prognostic information is essential for decision-making in breast cancer management. Recently trials have predominantly focused on genomic prognostication tools, even though clinicopathological prognostication is less costly and more widely accessible. Machine learning (ML), transfer learning and ensemble integration offer opportunities to build robust prognostication frameworks. We evaluate this potential to improve survival prognostication in breast cancer by comparing de-novo ML, transfer learning from a pre-trained prognostic tool and ensemble integration. Data from the MA.27 trial was used for model training, with external validation on the TEAM trial and a SEER cohort. Transfer learning was applied by fine-tuning the pre-trained prognostic tool PREDICT v3, de-novo ML included Random Survival Forests and Extreme Gradient Boosting, and ensemble integration was realized through a weighted sum of model predictions. Transfer learning, de-novo RSF, and ensemble integration improved calibration in MA.27 over the pre-trained model (ICI reduced from 0.042 in PREDICT v3 to <=0.007) while discrimination remained comparable (AUC increased from 0.738 in PREDICT v3 to 0.744-0.799). Invalid PREDICT v3 predictions were observed in 23.8-25.8% of MA.27 individuals due to missing information. In contrast, ML models and ensemble integration could predict survival regardless of missing information. Across all models, patient age, nodal status, pathological grading and tumor size had the highest SHAP values, indicating their importance for survival prognostication. External validation in SEER, but not in TEAM, confirmed the benefits of transfer learning, RSF and ensemble integration. This study demonstrates that transfer learning, de-novo RSF, and ensemble integration can improve prognostication in situations where relevant information for PREDICT v3 is lacking or where a dataset shift is likely.
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics such as Macro AUROC. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on three widely adopted datasets: CWRU, Paderborn University (PU), and University of Ottawa (UORED-VAFCLS). This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.
Model Monitoring
šŸ‘ šŸ‘Ž ♄ Save
National Taiwan Universt
Abstract
Model editing, the process of efficiently modifying factual knowledge in pre-trained language models, is critical for maintaining their accuracy and relevance. However, existing editing methods often introduce unintended side effects, degrading model performance in unpredictable ways. While much research has focused on improving editing algorithms, the role of the target knowledge's intrinsic properties remains a significant, underexplored factor. This paper addresses this gap by first proposing the ``Knowledge Spectrum,'' a systematic framework for categorizing knowledge based on its real-world popularity, the model's pre-edit familiarity, and the linguistic structure of the eliciting question. Our empirical analysis reveals that these characteristics are strong predictors of editing success and stability. Informed by these findings, we introduce the ``Knowledge-Diagnostic Framework,'' an adaptive strategy that tailors editing intensity to the diagnosed difficulty of a knowledge item. We demonstrate that this framework significantly improves success rates for challenging edits while optimizing computational resources. Our work provides a more comprehensive understanding of the factors governing model editing.
AI Insights
  • The 32 % reduction in editing time and cost on a 2 k‑item LLaMA 3.1 8B benchmark demonstrates the efficiency of the Knowledge‑Diagnostic Framework.
  • Adaptive editing narrows the performance gap between hard and easy knowledge edits while preserving the model’s general reasoning abilities.
  • Predictors of popularity, familiarity, and question type hint that future work should add more linguistic and contextual dimensions.
  • Limitations include a narrow set of knowledge characteristics and uncertain generalizability beyond the LLaMA 3.1 8B benchmark.
  • See Roi Cohen et al. (2023) on ripple effects and Junfeng Fang et al. (2025) on null‑space editing for deeper context.
  • The cost‑benefit analysis highlights the need for resource‑aware editing in large‑scale language models.
šŸ‘ šŸ‘Ž ♄ Save
Graduate Center, CUNY and
Abstract
Model discovery aims to uncover governing differential equations of dynamical systems directly from experimental data. Benchmarking such methods is essential for tracking progress and understanding trade-offs in the field. While prior efforts have focused mostly on identifying single equations, typically framed as symbolic regression, there remains a lack of comprehensive benchmarks for discovering dynamical models. To address this, we introduce MDBench, an open-source benchmarking framework for evaluating model discovery methods on dynamical systems. MDBench assesses 12 algorithms on 14 partial differential equations (PDEs) and 63 ordinary differential equations (ODEs) under varying levels of noise. Evaluation metrics include derivative prediction accuracy, model complexity, and equation fidelity. We also introduce seven challenging PDE systems from fluid dynamics and thermodynamics, revealing key limitations in current methods. Our findings illustrate that linear methods and genetic programming methods achieve the lowest prediction error for PDEs and ODEs, respectively. Moreover, linear models are in general more robust against noise. MDBench accelerates the advancement of model discovery methods by offering a rigorous, extensible benchmarking framework and a rich, diverse collection of dynamical system datasets, enabling systematic evaluation, comparison, and improvement of equation accuracy and robustness.
AI Insights
  • LLMs can learn and generate mathematical expressions from data, expanding symbolic regression’s reach.
  • Physics‑informed neural networks solve forward and inverse nonlinear PDEs, marrying deep learning with analytical physics.
  • Benchmark suites like PdeBench and MDBench standardize datasets and metrics, enabling reproducible comparison across noise levels.
  • Linear regression methods excel at PDE discovery, while genetic programming outperforms others on ODEs, revealing algorithmic specialization.
  • Linear models’ robustness to noise makes them preferable when experimental data are contaminated.
  • Symbolic regression’s impact spans fluid dynamics, cosmology, and materials science, showcasing its interdisciplinary power.
Machine Learning Deployment
šŸ‘ šŸ‘Ž ♄ Save
University of Westminster
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
This research investigates how Machine Learning (ML) algorithms can assist in workload allocation strategies by detecting tasks with node affinity operators (referred to as constraint operators), which constrain their execution to a limited number of nodes. Using real-world Google Cluster Data (GCD) workload traces and the AGOCS framework, the study extracts node attributes and task constraints, then analyses them to identify suitable node-task pairings. It focuses on tasks that can be executed on either a single node or fewer than a thousand out of 12.5k nodes in the analysed GCD cluster. Task constraint operators are compacted, pre-processed with one-hot encoding, and used as features in a training dataset. Various ML classifiers, including Artificial Neural Networks, K-Nearest Neighbours, Decision Trees, Naive Bayes, Ridge Regression, Adaptive Boosting, and Bagging, are fine-tuned and assessed for accuracy and F1-scores. The final ensemble voting classifier model achieved 98% accuracy and a 1.5-1.8% misclassification rate for tasks with a single suitable node.
AI Insights
  • Deep reinforcement learning is used to adapt task placement in real‑time cluster state changes.
  • Fuzzy machine learning models uncertainty in node affinity constraints for graceful task‑node mapping.
  • Leszek Sliwko’s AI‑driven load balancer balances resource use across 12.5k nodes via reinforcement signals.
  • The paper stresses balancing performance and interpretability, proposing hybrid explainable AI for scheduling.
  • ā€œThe Elements of Statistical Learningā€ guides selection and tuning of the ensemble classifiers.
  • ā€œNeural Network Designā€ offers practical steps for building the ANN that reached 98% accuracy.
  • Progressive Neural Networks and incremental deep CNN learning are cited as future directions for continual policy adaptation.
šŸ‘ šŸ‘Ž ♄ Save
Oak Ridge National Labort
Abstract
Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at https://github.com/at-aaims/OmniFed.
Machine Learning Resilience
šŸ‘ šŸ‘Ž ♄ Save
Paper visualization
Rate this image: šŸ˜ šŸ‘ šŸ‘Ž
Abstract
Resilience broadly describes a quality of withstanding perturbations. Measures of system resilience have gathered increasing attention across applied disciplines, yet existing metrics often lack computational accessibility and generalizability. In this work, we review the literature on resilience measures through the lens of dynamical systems theory and numerical methods. In this context, we reformulate pertinent measures into a general form and introduce a resource-efficient algorithm designed for their parallel numerical estimation. By coupling these measures with a global continuation of attractors, we enable their consistent evaluation along system parameter changes. The resulting framework is modular and easily extendable, allowing for the incorporation of new resilience measures as they arise. We demonstrate the framework on a range of illustrative dynamical systems, revealing key differences in how resilience changes across systems. This approach provides a more global perspective compared to traditional linear stability metrics used in local bifurcation analysis, which can overlook inconspicuous but significant shifts in system resilience. This work opens the door to genuinely novel lines of inquiry, such as the development of new early warning signals for critical transitions or the discovery of universal scaling behaviours. All code and computational tools are provided as an open-source contribution to the DynamicalSystems.jl software library.
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Risk assessments for advanced AI systems require evaluating both the models themselves and their deployment contexts. We introduce the Societal Capacity Assessment Framework (SCAF), an indicators-based approach to measuring a society's vulnerability, coping capacity, and adaptive capacity in response to AI-related risks. SCAF adapts established resilience analysis methodologies to AI, enabling organisations to ground risk management in insights about country-level deployment conditions. It can also support stakeholders in identifying opportunities to strengthen societal preparedness for emerging AI capabilities. By bridging disparate literatures and the "context gap" in AI evaluation, SCAF promotes more holistic risk assessment and governance as advanced AI systems proliferate globally.
Data Science Development Tools
šŸ‘ šŸ‘Ž ♄ Save
Warsaw University of the
Abstract
Computer games, as fully controlled simulated environments, have been utilized in significant scientific studies demonstrating the application of Reinforcement Learning (RL). Gaming and esports are key areas influenced by the application of Artificial Intelligence (AI) and Machine Learning (ML) solutions at scale. Tooling simplifies scientific workloads and is essential for developing the gaming and esports research area. In this work, we present ``SC2Tools'', a toolset containing multiple submodules responsible for working with, and producing larger datasets. We provide a modular structure of the implemented tooling, leaving room for future extensions where needed. Additionally, some of the tools are not StarCraft~2 exclusive and can be used with other types of data for dataset creation. The tools we present were leveraged in creating one of the largest StarCraft~2 tournament datasets to date with a separate PyTorch and PyTorch Lightning application programming interface (API) for easy access to the data. We conclude that alleviating the burden of data collection, preprocessing, and custom code development is essential for less technically proficient researchers to engage in the growing gaming and esports research area. Finally, our solution provides some foundational work toward normalizing experiment workflow in StarCraft~2
AI Insights
  • SC2Tools bundles data extraction, state analysis, and visualization into a modular pipeline, enabling rapid replay parsing.
  • The dataset preparator normalizes raw SC2 logs into PyTorch tensors, yielding one of the largest tournament corpora yet released.
  • PyTorch Lightning wrappers expose data loaders, letting researchers plug any neural architecture with minimal boilerplate.
  • ggtracker/sc2reader integration makes the toolset reusable for other RTS titles.
  • The paper omits a detailed neural‑model description, leaving implementation choices to the reader.
  • Scalability limits and failure modes are not discussed, a gap for future work.
  • Authors hint that SC2Tools could help quantify mental‑health metrics in esports players, opening interdisciplinary research.
Fault tolerance
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Safety-critical systems use redundant input units to improve their reliability and fault tolerance. A voting logic is then used to select a reliable input from the redundant sources. A fault detection and isolation rules help in selecting input units that can participate in voting. This work deals with the formal requirement formulation, design, verification and synthesis of a generic voting unit for an $N$-modular redundant measurement system used for control applications in avionics systems. The work follows a correct-by-construction approach, using the Rocq theorem prover.
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Resilience in optical networks has traditionally relied on redundancy and pre-planned recovery strategies, both of which assume a certain level of disaster predictability. However, recent environmental changes such as climate shifts, the evolution of communication services, and rising geopolitical risks have increased the unpredictability of disasters, reducing the effectiveness of conventional resilience approaches. To address this unpredictability, this paper introduces the concept of agile resilience, which emphasizes dynamic adaptability across multiple operators and layers. We identify key requirements and challenges, and present enabling technologies for the realization of agile resilience. Using a field-deployed transmission system, we demonstrate rapid system characterization, optical path provisioning, and database migration within six hours. These results validate the effectiveness of the proposed enabling technologies and confirm the feasibility of agile resilience.
Machine Learning Validation
šŸ‘ šŸ‘Ž ♄ Save
North Carolina A&T State
Abstract
Data-driven models, especially deep learning classifiers often demonstrate great success on clean datasets. Yet, they remain vulnerable to common data distortions such as adversarial and common corruption perturbations. These perturbations can significantly degrade performance, thereby challenging the overall reliability of the models. Traditional robustness validation typically relies on perturbed test datasets to assess and improve model performance. In our framework, however, we propose a validation approach that extracts "weak robust" samples directly from the training dataset via local robustness analysis. These samples, being the most susceptible to perturbations, serve as an early and sensitive indicator of the model's vulnerabilities. By evaluating models on these challenging training instances, we gain a more nuanced understanding of its robustness, which informs targeted performance enhancement. We demonstrate the effectiveness of our approach on models trained with CIFAR-10, CIFAR-100, and ImageNet, highlighting how robustness validation guided by weak robust samples can drive meaningful improvements in model reliability under adversarial and common corruption scenarios.
Machine Learning Infrastructure
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Machine learning interatomic potentials (MLIPs) have become powerful tools to extend molecular simulations beyond the limits of quantum methods, offering near-quantum accuracy at much lower computational cost. Yet, developing reliable MLIPs remains difficult because it requires generating high-quality datasets, preprocessing atomic structures, and carefully training and validating models. In this work, we introduce an Automated Machine Learning Pipeline (AMLP) that unifies the entire workflow from dataset creation to model validation. AMLP employs large-language-model agents to assist with electronic-structure code selection, input preparation, and output conversion, while its analysis suite (AMLP-Analysis), based on ASE supports a range of molecular simulations. The pipeline is built on the MACE architecture and validated on acridine polymorphs, where, with a straightforward fine-tuning of a foundation model, mean absolute errors of ~1.7 meV/atom in energies and ~7.0 meV/{\AA} in forces are achieved. The fitted MLIP reproduces DFT geometries with sub-{\AA} accuracy and demonstrates stability during molecular dynamics simulations in the microcanonical and canonical ensembles.
šŸ‘ šŸ‘Ž ♄ Save
UC Berkeley, LBNL, KAIST
Abstract
Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.
Online inference
šŸ‘ šŸ‘Ž ♄ Save
University of Chinese Acd
Abstract
This paper develops a new framework for indirect statistical inference with guaranteed necessity and sufficiency, applicable to continuous random variables. We prove that when comparing exponentially transformed order statistics from an assumed distribution with those from simulated unit exponential samples, the ranked quotients exhibit distinct asymptotics: the left segment converges to a non-degenerate distribution, while the middle and right segments degenerate to one. This yields a necessary and sufficient condition in probability for two sequences of continuous random variables to follow the same distribution. Building on this, we propose an optimization criterion based on relative errors between ordered samples. The criterion achieves its minimum if and only if the assumed and true distributions coincide, providing a second necessary and sufficient condition in optimization. These dual NS properties, rare in the literature, establish a fundamentally stronger inference framework than existing methods. Unlike classical approaches based on absolute errors (e.g., Kolmogorov-Smirnov), NSE exploits relative errors to ensure faster convergence, requires only mild approximability of the cumulative distribution function, and provides both point and interval estimates. Simulations and real-data applications confirm NSE's superior performance in preserving distributional assumptions where traditional methods fail.
AI Insights
  • Theorem 2.5 proves that the gap between successive order statistics vanishes in probability, a cornerstone of the method.
  • The proof uses the Glivenko‑Cantelli theorem to bound empirical distribution errors, ensuring ranked quotients converge as claimed.
  • By applying the law of large numbers, the relative‑error criterion reaches its minimum only when the assumed and true distributions match.
  • Extreme‑value theory is invoked to model tail behavior, allowing the framework to detect discrepancies that Kolmogorov‑Smirnov tests overlook.
  • Durrett’s ā€œProbability: Theory and Examplesā€ and Van der Vaart’s ā€œAsymptotic Statisticsā€ supply the probabilistic machinery behind the convergence proofs.
  • The paper’s technical rigor may challenge non‑experts, yet the clear exposition of order‑statistic limits offers a rewarding learning experience.
šŸ‘ šŸ‘Ž ♄ Save
Texas A&M University
Abstract
Inference in semi-supervised (SS) settings has gained substantial attention in recent years due to increased relevance in modern big-data problems. In a typical SS setting, there is a much larger-sized unlabeled data, containing only observations of predictors, and a moderately sized labeled data containing observations for both an outcome and the set of predictors. Such data naturally arises when the outcome, unlike the predictors, is costly or difficult to obtain. One of the primary statistical objectives in SS settings is to explore whether parameter estimation can be improved by exploiting the unlabeled data. We propose a novel Bayesian method for estimating the population mean in SS settings. The approach yields estimators that are both efficient and optimal for estimation and inference. The method itself has several interesting artifacts. The central idea behind the method is to model certain summary statistics of the data in a targeted manner, rather than the entire raw data itself, along with a novel Bayesian notion of debiasing. Specifying appropriate summary statistics crucially relies on a debiased representation of the population mean that incorporates unlabeled data through a flexible nuisance function while also learning its estimation bias. Combined with careful usage of sample splitting, this debiasing approach mitigates the effect of bias due to slow rates or misspecification of the nuisance parameter from the posterior of the final parameter of interest, ensuring its robustness and efficiency. Concrete theoretical results, via Bernstein--von Mises theorems, are established, validating all claims, and are further supported through extensive numerical studies. To our knowledge, this is possibly the first work on Bayesian inference in SS settings, and its central ideas also apply more broadly to other Bayesian semi-parametric inference problems.
AI Insights
  • The proof shows posterior concentration around the true mean via the non‑parametric consistency condition (NPCC).
  • NPCC ensures that for any ε>0, posterior mass within ε of truth tends to one, establishing asymptotic unbiasedness.
  • The law of iterated expectation decomposes the bias, proving its vanishing as sample size grows.
  • Bernstein–von Mises theorems justify a normal posterior approximation, guaranteeing efficiency despite a flexible nuisance function.
  • Sample splitting isolates bias from slow‑rate nuisance estimation, keeping the final posterior robust.
  • Key literatureā€”ā€œConsistency of Posterior Distributionsā€ and ā€œNon‑Parametric Consistency Conditions for Bayesian Estimatorsā€ā€”provides the theoretical backbone.
  • Together, the argument confirms the Bayesian SS estimator is asymptotically unbiased, consistent, and theoretically sound.
Machine Learning Testing
šŸ‘ šŸ‘Ž ♄ Save
Abstract
Context: Deep Neural Networks (DNNs) are increasingly deployed in critical applications, where resilience against adversarial inputs is paramount. However, whether coverage-based or confidence-based, existing test prioritization methods often fail to efficiently identify the most fault-revealing inputs, limiting their practical effectiveness. Aims: This project aims to enhance fault detection and model robustness in DNNs by integrating Learning-Based Testing (LBT) with hypothesis and mutation testing to efficiently prioritize adversarial test cases. Methods: Our method selects a subset of adversarial inputs with a high likelihood of exposing model faults, without relying on architecture-specific characteristics or formal verification, making it adaptable across diverse DNNs. Results: Our results demonstrate that the proposed LBT method consistently surpasses baseline approaches in prioritizing fault-revealing inputs and accelerating fault detection. By efficiently organizing test permutations, it uncovers all potential faults significantly faster across various datasets, model architectures, and adversarial attack techniques. Conclusion: Beyond improving fault detection, our method preserves input diversity and provides effective guidance for model retraining, further enhancing robustness. These advantages establish our approach as a powerful and practical solution for adversarial test prioritization in real-world DNN applications.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • MLOps
You can edit or add more interests any time.

Unsubscribe from these updates