Hi j34nc4rl0+mlops,

Here is our personalized paper recommendations for you sorted by most relevant
Fault tolerance
Abstract
Proving threshold theorems for fault-tolerant quantum computation is a burdensome endeavor with many moving parts that come together in relatively formulaic but lengthy ways. It is difficult and rare to combine elements from multiple papers into a single formal threshold proof, due to the use of different measures of fault-tolerance. In this work, we introduce composable fault-tolerance, a framework that decouples the probabilistic analysis of the noise distribution from the combinatorial analysis of circuit correctness, and enables threshold proofs to compose independently analyzed gadgets easily and rigorously. Within this framework, we provide a library of standard and commonly used gadgets such as memory and logic implemented by constant-depth circuits for quantum low-density parity check codes and distillation. As sample applications, we explicitly write down a threshold proof for computation with surface code and re-derive the constant space-overhead fault-tolerant scheme of Gottesman using gadgets from this library. We expect that future fault-tolerance proofs may focus on the analysis of novel techniques while leaving the standard components to the composable fault-tolerance framework, with the formal proof following the intuitive ``napkin math'' exactly.
Abstract
Deep Neural Network (DNN) has achieve great success in solving a wide range of machine learning problems. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. So, their correct functionality in the presence of potential bit-flip errors on DNN parameters stored in memories plays the key role in their applicability in safety-critical applications. In this paper, a fault tolerance approach based on Error Correcting Codes (ECC), called SPW, is proposed to ensure the correct functionality of DNNs in the presence of bit-flip faults. In the proposed approach, error occurrence is detected by the stored ECC and then, it is correct in case of a single-bit error or the weight is completely set to zero (i.e. masked) otherwise. A statistical fault injection campaign is proposed and utilized to investigate the efficacy of the proposed approach. The experimental results show that the accuracy of the DNN increases by more than 300% in the presence with Bit Error Rate of 10^(-1) in comparison to the case where ECC technique is applied, in expense of just 47.5% area overhead.
Machine Learning Lifecycle
Abstract
Today, two major trends are shaping the evolution of ML systems. First, modern AI systems are becoming increasingly complex, often integrating components beyond the model itself. A notable example is Retrieval-Augmented Generation (RAG), which incorporates not only multiple models but also vector databases, leading to heterogeneity in both system components and underlying hardware. Second, with the end of Moore's Law, achieving high system efficiency is no longer feasible without accounting for the rapid evolution of the hardware landscape. Building on the observations above, this thesis adopts a cross-stack approach to improving ML system efficiency, presenting solutions that span algorithms, systems, and hardware. First, it introduces several pioneering works about RAG serving efficiency across the computing stack. PipeRAG focuses on algorithm-level improvements, RAGO introduces system-level optimizations, and Chameleon explores heterogeneous accelerator systems for RAG. Second, this thesis investigates algorithm-hardware co-design for vector search. Specifically, FANNS and Falcon optimize quantization-based and graph-based vector search, the two most popular paradigms of retrieval algorithms. Third, this thesis addresses the serving efficiency of recommender systems, another example of vector-centric ML systems, where the memory-intensive lookup operations on embedding vector tables often represent a major performance bottleneck. MicroRec and FleetRec propose solutions at the hardware and system levels, respectively, optimizing both data movement and computation to enhance the efficiency of large-scale recommender models.
Abstract
Machine learning (ML) systems are increasingly deployed in high-stakes domains where reliability is paramount. This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of ML, focusing on selective prediction -- where models abstain when confidence is low. We first show that a model's training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss. By ensembling predictions from intermediate checkpoints, we propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance. Crucially, this approach is fully compatible with differential privacy (DP), allowing us to study how privacy noise affects uncertainty quality. We find that while many methods degrade under DP, our trajectory-based approach remains robust, and we introduce a framework for isolating the privacy-uncertainty trade-off. Next, we then develop a finite-sample decomposition of the selective classification gap -- the deviation from the oracle accuracy-coverage curve -- identifying five interpretable error sources and clarifying which interventions can close the gap. This explains why calibration alone cannot fix ranking errors, motivating methods that improve uncertainty ordering. Finally, we show that uncertainty signals can be adversarially manipulated to hide errors or deny service while maintaining high accuracy, and we design defenses combining calibration audits with verifiable inference. Together, these contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that not only make accurate predictions -- but also know when to say "I do not know".
Data Science Development Tools
Abstract
In recent years, Large Language Models (LLMs) have emerged as transformative tools across numerous domains, impacting how professionals approach complex analytical tasks. This systematic mapping study comprehensively examines the application of LLMs throughout the Data Science lifecycle. By analyzing relevant papers from Scopus and IEEE databases, we identify and categorize the types of LLMs being applied, the specific stages and tasks of the data science process they address, and the methodological approaches used for their evaluation. Our analysis includes a detailed examination of evaluation metrics employed across studies and systematically documents both positive contributions and limitations of LLMs when applied to data science workflows. This mapping provides researchers and practitioners with a structured understanding of the current landscape, highlighting trends, gaps, and opportunities for future research in this rapidly evolving intersection of LLMs and data science.
Abstract
Efficient and effective data discovery is critical for many modern applications in machine learning and data science. One major bottleneck to the development of a general-purpose data discovery tool is the absence of an expressive formal language, and corresponding implementation, for characterizing and solving generic discovery queries. To this end, we present TQL, a domain-specific language for data discovery well-designed to leverage and exploit the results of programming languages research in both its syntax and semantics. In this paper, we fully and formally characterize the core language through an algebraic model, Imperative Relational Algebra with Types (ImpRAT), and implement a modular proof-of-concept system prototype.
Machine Learning Operations
Abstract
Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, our work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization. This framework allows us to systematically review the literature on data minimization and \emph{DM-adjacent} methodologies, for the first time presenting a structured overview designed to help practitioners and researchers effectively apply DM principles. Our work facilitates a unified DM-centric understanding and broader adoption of data minimization strategies in AI/ML.
Abstract
The ALP Automatic Computing Algorithm, ALPaca, is an open source Python library devoted to studying the phenomenology of Axion-Like Particles (ALPs) with masses in the ranges $m_a \in [0.01 - 10]$ GeV. ALPaca provides a flexible and comprehensive framework to define ALP couplings at arbitrary energy scales, perform Renormalisation Group evolution and matching down to the desired low energy scale, and compute a large variety of ALP observables, with particular care to the meson decay sector. The package includes support for UV completions, experimental constraints, and visualisation tools, enabling both detailed analyses and broad parameter space exploration.
Data Science Development Environment and Productivity
Abstract
Although Artificial Intelligence (AI) holds great promise for enhancing innovation and productivity, many firms struggle to realize its benefits. We investigate why some firms and industries succeed with AI while others do not, focusing on the degree to which an industrial domain is technologically integrated with AI, which we term "domain AI readiness". Using panel data on Chinese listed firms from 2016 to 2022, we examine how the interaction between firm-level AI capabilities and domain AI readiness affects firm performance. We create novel constructs from patent data and measure the domain AI readiness of a specific domain by analyzing the co-occurrence of four-digit International Patent Classification (IPC4) codes related to AI with the specific domain across all patents in that domain. Our findings reveal a strong complementarity: AI capabilities yield greater productivity and innovation gains when deployed in domains with higher AI readiness, whereas benefits are limited in domains that are technologically unprepared or already obsolete. These results remain robust when using local AI policy initiatives as instrumental variables. Further analysis shows that this complementarity is driven by external advances in domain-AI integration, rather than firms' own strategic pivots. Time-series analysis of IPC4 co-occurrence patterns further suggests that improvements in domain AI readiness stem primarily from the academic advancements of AI in specific domains.
Machine Learning Validation
Abstract
Rigorous statistical methods, including parameter estimation with accompanying uncertainties, underpin the validity of scientific discovery, especially in the natural sciences. With increasingly complex data models such as deep learning techniques, uncertainty quantification has become exceedingly difficult and a plethora of techniques have been proposed. In this case study, we use the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation: (i) a neural network ensemble, (ii) neural network ensemble with conflictual loss, (iii) evidential deep learning, (iv) a single neural network with Monte Carlo Dropout, (v) Gaussian process classification and (vi) a Dirichlet process mixture model. We check if the algorithms produce uncertainty estimates which reflect commonly desired properties, such as being well calibrated and exhibiting an increase in uncertainty for out-of-distribution data points. Our results indicate that all algorithms are well calibrated, but none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points. We hope our study may serve as a clarifying example for researchers developing new methods of uncertainty estimation for scientific data-driven modeling.
Machine Learning Testing
Abstract
In targeted adversarial attacks on vision models, the selection of the target label is a critical yet often overlooked determinant of attack success. This target label corresponds to the class that the attacker aims to force the model to predict. Now, existing strategies typically rely on randomness, model predictions, or static semantic resources, limiting interpretability, reproducibility, or flexibility. This paper then proposes a semantics-guided framework for adversarial target selection using the cross-modal knowledge transfer from pretrained language and vision-language models. We evaluate several state-of-the-art models (BERT, TinyLLAMA, and CLIP) as similarity sources to select the most and least semantically related labels with respect to the ground truth, forming best- and worst-case adversarial scenarios. Our experiments on three vision models and five attack methods reveal that these models consistently render practical adversarial targets and surpass static lexical databases, such as WordNet, particularly for distant class relationships. We also observe that static testing of target labels offers a preliminary assessment of the effectiveness of similarity sources, \textit{a priori} testing. Our results corroborate the suitability of pretrained models for constructing interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.
Model Monitoring
Abstract
The rapid emergence of pretrained models (PTMs) has attracted significant attention from both Deep Learning (DL) researchers and downstream application developers. However, selecting appropriate PTMs remains challenging because existing methods typically rely on keyword-based searches in which the keywords are often derived directly from function descriptions. This often fails to fully capture user intent and makes it difficult to identify suitable models when developers also consider factors such as bias mitigation, hardware requirements, or license compliance. To address the limitations of keyword-based model search, we propose PTMPicker to accurately identify suitable PTMs. We first define a structured template composed of common and essential attributes for PTMs and then PTMPicker represents both candidate models and user-intended features (i.e., model search requests) in this unified format. To determine whether candidate models satisfy user requirements, it computes embedding similarities for function-related attributes and uses well-crafted prompts to evaluate special constraints such as license compliance and hardware requirements. We scraped a total of 543,949 pretrained models from Hugging Face to prepare valid candidates for selection. PTMPicker then represented them in the predefined structured format by extracting their associated descriptions. Guided by the extracted metadata, we synthesized a total of 15,207 model search requests with carefully designed prompts, as no such search requests are readily available. Experiments on the curated PTM dataset and the synthesized model search requests show that PTMPicker can help users effectively identify models,with 85% of the sampled requests successfully locating appropriate PTMs within the top-10 ranked candidates.
Abstract
Evaluations of dangerous AI capabilities are important for managing catastrophic risks. Public transparency into these evaluations - including what they test, how they are conducted, and how their results inform decisions - is crucial for building trust in AI development. We propose STREAM (A Standard for Transparently Reporting Evaluations in AI Model Reports), a standard to improve how model reports disclose evaluation results, initially focusing on chemical and biological (ChemBio) benchmarks. Developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies, this standard is designed to (1) be a practical resource to help AI developers present evaluation results more clearly, and (2) help third parties identify whether model reports provide sufficient detail to assess the rigor of the ChemBio evaluations. We concretely demonstrate our proposed best practices with "gold standard" examples, and also provide a three-page reporting template to enable AI developers to implement our recommendations more easily.
Machine Learning Deployment
Abstract
Urban parks can mitigate local heat, yet irrigation control is usually tuned for water savings rather than cooling. We report on SIMPaCT (Smart Irrigation Management for Parks and Cool Towns), a park-scale deployment that links per-zone soil-moisture forecasts to overnight irrigation set-points in support of urban cooling. SIMPaCT ingests data from 202 soil-moisture sensors, 50 temperature-relative humidity (TRH) nodes, and 13 weather stations, and trains a per-sensor k-nearest neighbours (kNN) predictor on short rolling windows (200-900h). A rule-first anomaly pipeline screens missing and stuck-at signals, with model-based checks (Isolation Forest and ARIMA). When a device fails, a mutual-information neighbourhood selects the most informative neighbour and a small multilayer perceptron supplies a "virtual sensor" until restoration. Across sensors the mean absolute error was 0.78%, comparable to more complex baselines; the upper-quartile error (P75) was lower for kNN than SARIMA (0.71% vs 0.93%). SIMPaCT runs daily and writes proposed set-points to the existing controller for operator review. This short communication reports an operational recipe for robust, cooling-oriented irrigation at city-park scale.
Online inference
Abstract
Selective prediction [Dru13, QV19] models the scenario where a forecaster freely decides on the prediction window that their forecast spans. Many data statistics can be predicted to a non-trivial error rate without any distributional assumptions or expert advice, yet these results rely on that the forecaster may predict at any time. We introduce a model of Prediction with Limited Selectivity (PLS) where the forecaster can start the prediction only on a subset of the time horizon. We study the optimal prediction error both on an instance-by-instance basis and via an average-case analysis. We introduce a complexity measure that gives instance-dependent bounds on the optimal error. For a randomly-generated PLS instance, these bounds match with high probability.
Abstract
Given a dataset consisting of a single realization of a network, we consider conducting inference on a parameter selected from the data. In particular, we focus on the setting where the parameter of interest is a linear combination of the mean connectivities within and between estimated communities. Inference in this setting poses a challenge, since the communities are themselves estimated from the data. Furthermore, since only a single realization of the network is available, sample splitting is not possible. In this paper, we show that it is possible to split a single realization of a network consisting of $n$ nodes into two (or more) networks involving the same $n$ nodes; the first network can be used to select a data-driven parameter, and the second to conduct inference on that parameter. In the case of weighted networks with Poisson or Gaussian edges, we obtain two independent realizations of the network; by contrast, in the case of Bernoulli edges, the two realizations are dependent, and so extra care is required. We establish the theoretical properties of our estimators, in the sense of confidence intervals that attain the nominal (selective) coverage, and demonstrate their utility in numerical simulations and in application to a dataset representing the relationships among dolphins in Doubtful Sound, New Zealand.
Machine Learning Resilience
Abstract
We introduce the fragility spectrum, a quantitative framework to measure the resilience of model-theoretic properties (e.g., stability, NIP, NTP$_2$, decidability) under language expansions. The core is the fragility index $\operatorname{frag}(T, P \to Q)$, quantifying the minimal expansion needed to degrade from property $P$ to $Q$. We axiomatize fragility operators, prove stratification theorems, identify computational, geometric, and combinatorial collapse modes, and position it within Shelah's hierarchy. Examples include ACF$_0$ (infinite fragility for stability) and $\operatorname{Th}(\mathbb{Q}, +)$ (fragility 1 for $\omega$-stability). Connections to DOP, ranks, and external definability refine classifications. Extended proofs, applications to other logics, and open problems enhance the discourse.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • MLOps
  • Machine Learning Infrastructure
You can edit or add more interests any time.

Unsubscribe from these updates