Hi!

Your personalized paper recommendations for 24 to 28 November, 2025.

🎯 Top Personalized Recommendations

Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

Anthropomimetic Uncertainty: A measure of the uncertainty expressed by a model in its output, which can be used to evaluate its reliability and trustworthiness. [3]
The article discusses the limitations and challenges of Large Language Models (LLMs) in various applications. [2]

Abstract
Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure patterns differ fundamentally from those of traditional machine learning models. This paper presents a system-level taxonomy of fifteen hidden failure modes that arise in real-world LLM applications, including multi-step reasoning drift, latent inconsistency, context-boundary degradation, incorrect tool invocation, version drift, and cost-driven performance collapse. Using this taxonomy, we analyze the growing gap in evaluation and monitoring practices: existing benchmarks measure knowledge or reasoning but provide little insight into stability, reproducibility, drift, or workflow integration. We further examine the production challenges associated with deploying LLMs - including observability limitations, cost constraints, and update-induced regressions - and outline high-level design principles for building reliable, maintainable, and cost-aware LLM systems. Finally, we outline high-level design principles for building reliable, maintainable, and cost-aware LLM-based systems. By framing LLM reliability as a system-engineering problem rather than a purely model-centric one, this work provides an analytical foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.

Why we think this paper is great for you:
This paper directly addresses the critical need for understanding and mitigating failures in AI systems, which is essential for building reliable applications in production environments. It offers a valuable taxonomy for ensuring the robustness of your deployed models.

A Set of Rules for Model Validation

Rate paper: 👍 👎 ♥ Save

AI Summary

Cross-validation: A method used to evaluate the performance of a model by training it on a subset of data and testing it on another subset. [3]
Repeated double cross-validation: An extension of cross-validation that involves repeating the process multiple times with different subsets of data. [3]
Performance metrics: Quantitative measures used to evaluate the performance of a model, such as accuracy, precision, and recall. [3]
The article concludes that proper validation is crucial in chemometric models to avoid overfitting and ensure reliable results. [3]
It highlights the limitations and potential biases of cross-validation methods. [2]
The article discusses the importance of proper validation in chemometric models. [1]

Abstract
The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.

Why we think this paper is great for you:
This paper provides a foundational set of rules for assessing a model's ability to generalize, which is crucial for establishing reliable validation plans. You will find its guidance invaluable for ensuring the trustworthiness of your machine learning models.

GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

They conduct experiments to assess the effectiveness of their automated annotation pipeline using three complementary analyses: agreement among MLLMs, accuracy relative to human annotations, and correctness of preference judgments produced by Qwen3-VL-Plus as an external oracle model. [3]
The paper discusses a safety benchmark for multimodal systems, specifically language models (LLMs) and their applications in various domains. [2]

Abstract
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

Why we think this paper is great for you:
This work focuses on detecting unsafe reasoning in multimodal models, directly supporting your goal of deploying safe and resilient AI systems. It offers insights into monitoring and mitigating risks in complex production deployments.

Optimization-Aware Test Generation for Deep Learning Compilers

Rate paper: 👍 👎 ♥ Save

Abstract
Deep Learning (DL) compilers have been widely utilized to optimize DL models for efficient deployment across various hardware. Due to their vital role in the DL ecosystem, ensuring their reliability and security is critical. However, existing approaches have limitations in testing optimization stages, which is the core functionality of DL compilers, due to the difficulty in generating optimization-aware tests. In this paper, we proposed OATest, a novel approach for synthesizing optimization-aware computational graphs. The approach combines patterns extracted from documented tests for optimization and incorporates them into seed computational graphs, enabling broader exploration of optimization paths. To guarantee the optimization-awareness of generated graphs, OATest introduces the edges reusing strategy to establish strong connections between patterns and contexts. Additionally, to solve the validity challenge for the generated graphs, OATest employs an auxiliary layers addition strategy to resolve broken constraints. Equipped with two distinct test oracles, OATest applies differential testing to evaluate the two widely used DL compilers (i.e., TVM and ONNXRuntime). Our experimental results show that OATest outperforms the state-of-the-art method by detecting more bugs and achieving higher code coverage in TVM and ONNXRutimes. Additionally, OATest uncovers 58 previously unknown bugs, 36 of which have been confirmed or fixed by developers.

Why we think this paper is great for you:
Ensuring the reliability of deep learning compilers is vital for robust machine learning infrastructure, and this paper offers methods for generating effective tests. This directly contributes to your efforts in machine learning testing and building resilient systems.

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

The authors introduce two indicators, δ1 and δ2, to measure the impact of data contamination on model performance. [3]
The study demonstrates that the proposed updated dataset consistently achieves lower variance compared to its counterparts, indicating more stable and robust evaluation metrics. [3]
Data contamination: The inflated performance of a model on a specific dataset or benchmark due to the leakage of test data. [3]
δ1: Measures the performance difference between the model's evaluation results after training solely on the test set and its zero-shot performance. [3]
δ2: Compares the performance difference between models trained on both train and test sets versus those trained exclusively on the train set. [3]
The updated dataset and indicators introduced in this work contribute to more reliable and robust evaluation metrics. [3]
The paper proposes a novel framework for evaluating the performance of large language models (LLMs) in real-world scenarios, addressing the issue of data contamination. [2]

Abstract
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

Why we think this paper is great for you:
This research tackles data contamination to build more resilient datasets, which is fundamental for reliable LLM evaluation and overall model robustness. It provides practical insights for improving the integrity of your machine learning testing and validation processes.

Best Practices for Machine Learning Experimentation in Scientific Applications

Rate paper: 👍 👎 ♥ Save

AI Summary

{ "id": "Insight-1", "description": "The Logarithmic Overfitting Ratio (LOR) and Composite Overfitting Ratio (COS) are essential metrics to evaluate model performance and detect overfitting or underfitting." } { "id": "Insight-2", "description": "Prefer Monte Carlo Cross Validation (MC CV) over k-Fold Cross Validation for larger, stable datasets." } { "id": "Insight-3", "description": "Always report the LOR and COS values to facilitate model selection and avoid overfitting or underfitting." } { "id": "Insight-4", "description": "Use a uniform metric across models for fair comparison, such as MAE or accuracy." } { "id": "Insight-5", "description": "Highlight the best model(s) in bold or with visual cues to facilitate interpretation of results." } { "name": "Logarithmic Overfitting Ratio (LOR)", "description": "A metric that evaluates the difference between training and test errors, indicating overfitting or underfitting." } { "name": "Composite Overfitting Ratio (COS)", "description": "A metric that combines LOR with standard deviation to evaluate model performance and detect overfitting or underfitting." } { "id": "Insight-6", "description": "Following these best practices will help you select meaningful results, avoid overfitting or underfitting, and improve the overall quality of your machine learning models." } { "id": "Insight-7", "description": "Remember to report all findings, including training/test metrics and standard deviations, to provide a comprehensive understanding of model performance." } { "id": "Weakness-1", "description": "The paper assumes that readers are familiar with machine learning concepts and terminology." } [3]

Abstract
Machine learning (ML) is increasingly adopted in scientific research, yet the quality and reliability of results often depend on how experiments are designed and documented. Poor baselines, inconsistent preprocessing, or insufficient validation can lead to misleading conclusions about model performance. This paper presents a practical and structured guide for conducting ML experiments in scientific applications, focussing on reproducibility, fair comparison, and transparent reporting. We outline a step-by-step workflow, from dataset preparation to model selection and evaluation, and propose metrics that account for overfitting and instability across validation folds, including the Logarithmic Overfitting Ratio (LOR) and the Composite Overfitting Score (COS). Through recommended practices and example reporting formats, this work aims to support researchers in establishing robust baselines and drawing valid evidence-based insights from ML models applied to scientific problems.

Why we think this paper is great for you:
This paper outlines best practices for designing and documenting ML experiments, directly enhancing the quality and reliability of your results. It offers guidance for improving your data science development environment and overall machine learning lifecycle.

InferF: Declarative Factorization of AI/ML Inferences over Joins

Rate paper: 👍 👎 ♥ Save

Abstract
Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in the join's output, factorized ML has been proposed to decompose ML computations into sub-computations to be executed on each normalized dataset. However, there is insufficient discussion on how factorized ML could impact AI/ML inference over multi-way joins. To address the limitations, we propose a novel declarative InferF system, focusing on the factorization of arbitrary inference workflows represented as analyzable expressions over the multi-way joins. We formalize our problem to flexibly push down partial factorized computations to qualified nodes in the join tree to minimize the overall inference computation and join costs and propose two algorithms to resolve the problem: (1) a greedy algorithm based on a per-node cost function that estimates the influence on overall latency if a subset of factorized computations is pushed to a node, and (2) a genetic algorithm for iteratively enumerating and evaluating promising factorization plans. We implement InferF on Velox, an open-sourced database engine from Meta, evaluate it on real-world datasets, observed up to 11.3x speedups, and systematically summarized the factors that determine when factorized ML can benefit AI/ML inference workflows.

Why we think this paper is great for you:
This paper explores optimizing AI/ML inference computations, which is key for efficient online inference and robust machine learning infrastructure. It provides valuable techniques for improving the performance and cost-effectiveness of your deployed models.

Machine Learning Resilience

Validity in machine learning for extreme event attribution

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Extreme event attribution (EEA), an approach for assessing the extent to which disasters are caused by climate change, is crucial for informing climate policy and legal proceedings. Machine learning is increasingly used for EEA by modeling rare weather events otherwise too complex or computationally intensive to model using traditional simulation methods. However, the validity of using machine learning in this context remains unclear, particularly as high-stakes machine learning applications in general are criticized for inherent bias and lack of robustness. Here we use machine learning and simulation analyses to evaluate EEA in the context of California wildfire data from 2003-2020. We identify three major threats to validity: (1) individual event attribution estimates are highly sensitive to algorithmic design choices; (2) common performance metrics like area under the ROC curve or Brier score are not strongly correlated with attribution error, facilitating suboptimal model selection; and (3) distribution shift -- changes in temperature across climate scenarios -- substantially degrades predictive performance. To address these challenges, we propose a more valid and robust attribution analysis based on aggregate machine learning estimates, using an additional metric -- mean calibration error -- to assess model performance, and using subgroup and propensity diagnostics to assess distribution shift.

AI Summary

The accuracy of FAR estimates decreases with distribution shift, as demonstrated by regret calculations. [3]
Prediction-powered inference (PPI) does not improve the accuracy of FAR estimates, even when using propensity score weighting. [3]
Brier score and accuracy of the risk ratio are correlated, but selected data is omitted for clarity of visualization. [3]
Brier Skill Score: A metric that measures the accuracy of a model's predictions compared to a baseline or reference model. [3]
Machine learning methods can be used to estimate the fraction of attributable risk (FAR) in climate science. [2]

Fault tolerance

Error-structure-tailored early fault-tolerant quantum computing

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Fault tolerance is widely regarded as indispensable for achieving scalable and reliable quantum computing. However, the spacetime overhead required for fault-tolerant quantum computating remains prohibitively large. A critical challenge arises in many quantum algorithms with Clifford + $\varphi$ compiling, where logical rotation gates $R_{Z_L}(\varphi)$ serve as essential components. The Eastin-Knill theorem prevents their transversal implementation in quantum error correction codes and necessitating resource-intensive workarounds through T-gate compilation combined with magic state distillation and injection. In this work, we consider error-structure-tailored fault tolerance, where fault-tolerance conditions are analyzed by combining perturbative analysis of realistic dissipative noise processes with the structural properties of stabilizer codes. Based on this framework, we design 1-fault-tolerant continuous-angle rotation gates in stabilizer codes, implemented via dispersive-coupling Hamiltonians. Our approach could circumvent the need for T-gate compilation and distillation, offering a hardware-efficient solution that maintains simplicity, minimizes physical footprint, and requires only nearest-neighbor interactions. Integrating with recent small-angle-state preparation techniques, we can suppress the gate error to $91|\varphi| p^2$ for small rotation angle (where p denotes the physical error rate). For current achievable hardware parameters ($p=10^{-3}$), this enables reliable execution of over $10^7$ small-angle rotations when $|\varphi|\approx 10^{-3}$, meeting the requirements of many near-term quantum applications. Compared to the 15-to-1 magic state distillation and magic state cultivation approaches, our method reduces spacetime resource costs by factors of 1337.5 and 43.6, respectively, for a Heisenberg Hamiltonian simulation task under realistic hardware assumptions.

AI Summary

The projection scheme for preparing a logical rotation state |rφ⟩L with a low error rate of O(|φ|·p2) is presented. [1]
To increase the success probability, the ZZ-rotation gate is expanded non-transversally to a ZZZ-rotation gate. [0]

Data Science Development Environment and Productivity

The DataSquad Experiment: Lessons for Preparing Data and Computer Scientists for Work

Rate paper: 👍 👎 ♥ Save

Abstract
The DataSquad at Carleton College addresses a common problem at small liberal arts colleges: limited capacity for data services and few opportunities for students to gain practical experience with data and software development. Academic Technologist Paula Lackie designed the program as a work-study position that trains undergraduates through structured peer mentorship and real client projects. Students tackle data problems of increasing complexity-from basic data analysis to software development-while learning FAIR data principles and open science practices. The model's core components (peer mentorship structure, project-based learning, and communication training) make it adaptable to other institutions. UCLA and other colleges have adopted the model using openly shared materials through "DataSquad International." This paper describes the program's implementation at Carleton College and examines how structured peer mentorship can simultaneously improve institutional data services and provide students with professional skills and confidence.

AI Summary

The DataSquad program has been highly effective in providing students with practical experience and skills in data science, software engineering, and project management. [3]
The program's emphasis on teamwork, communication, and client interaction has helped students develop valuable soft skills. [3]
The DataSquad environment is highly encouraging, with 100% of alumni reporting that they felt encouraged to participate in the program. [3]
Statistical Analysis: Collecting, exploring and presenting large amounts of data to discover underlying patterns and trends Database Design/Cloud Systems: Designing a safe place to capture your data (in SQL or other), working with data capture or management tools like Qualtrics or Google Forms Coding, Software Engineering: Using programming languages, such as Python, R, etc., and utilizing file management tools like Git Project Management/Planning: Organizing tasks, managing time, and coordinating resources to achieve goals Effective Teamwork: Collaborating well with others, supporting teammates, and achieving shared objectives [3]
Many students experienced multiple roles during their tenure, gaining breadth across the program's offerings. [2]

SpaceX: Exploring metrics with the SPACE model for developer productivity

Rate paper: 👍 👎 ♥ Save

Abstract
This empirical investigation elucidates the limitations of deterministic, unidimensional productivity heuristics by operationalizing the SPACE framework through extensive repository mining. Utilizing a dataset derived from open-source repositories, the study employs rigorous statistical methodologies including Generalized Linear Mixed Models (GLMM) and RoBERTa-based sentiment classification to synthesize a holistic, multi-faceted productivity metric. Analytical results reveal a statistically significant positive correlation between negative affective states and commit frequency, implying a cycle of iterative remediation driven by frustration. Furthermore, the investigation has demonstrated that analyzing the topology of contributor interactions yields superior fidelity in mapping collaborative dynamics compared to traditional volume-based metrics. Ultimately, this research posits a Composite Productivity Score (CPS) to address the heterogeneity of developer efficacy.

AI Summary

The study found a statistically significant positive association between an author's standardized percentage of negative sentiment commits and their total number of commits to a project. [3]
At the contributor level, code churn was strongly associated with total commits, indicating that developers with higher activity levels tend to generate more code modifications. [3]
Regression analyses of PR merge time yielded low explanatory power, indicating that the factors included in the model—CI/CD success rate and total commits—do not fully capture the variability in merge efficiency. [3]
PR: Pull Request CI/CD: Continuous Integration/Continuous Deployment CIF: Code Integrity Factor The study suggests that traditional metrics may not be sufficient to capture the complexity of developer productivity and performance. [3]
Low explanatory power of regression analyses [3]
Additional variables such as PR size, review latency, and test reliability should be considered to enhance predictive accuracy and derive actionable insights for improving repository performance. [2]

Machine Learning Infrastructure

Application of machine learning for infrastructure reconstruction programs management

Rate paper: 👍 👎 ♥ Save

Abstract
The purpose of this article is to describe an adaptive decision-making support model aimed at improving the efficiency of engineering infrastructure reconstruction program management in the context of developing the architecture and work breakdown structure of programs. As part of the study, the existing adaptive program management tools are analyzed, the use of infrastructure systems modelling tools is justified for program architecture and WBS creation. Existing models and modelling methods are viewed, and machine learning and artificial neural networks are selected for the model. The main components of the model are defined, which include a set of decision-maker preferences, decision-making tasks, sets of input data, and applied software components of the model. To support decision-making, the adaptive model applies the method of system modeling and predicting the value of the objective function at a given system configuration. Prediction is done using machine learning methods based on a dataset consisting of historical data related to existing engineering systems. The work describes the components of the redistribution of varied model parameters, which modify the model dataset based on the selected object type, which allows adapting the decision-making process to the existing program implementation goals. The functional composition done in Microsoft Azure Machine Learning Studio is described. The neural network parameters and evaluation results are given. The application of the developed adaptive model is possible in the management of programs for the reconstruction of such engineering systems as systems of heat, gas, electricity supply, water supply, and drainage, etc.

AI Summary

Azure Machine Learning Studio with Python scripts was used for induction motors optimization web-deploy project. [3]
Graph neural networks were surveyed in a recent article, highlighting their potential applications in traffic forecasting. [3]
Azure Machine Learning Studio: A cloud-based platform for building, deploying, and managing machine learning models. [3]
Graph neural networks: A type of neural network designed to process graph-structured data. [3]
Azure Machine Learning Studio with Python scripts offers a powerful tool for building and deploying machine learning models in various applications. [3]
The article discusses the application of machine learning in infrastructure reconstruction programs, specifically focusing on heat supply systems. [2]
Object-oriented mathematical modeling of electrical machines was presented at the 2020 IEEE 4th International Conference on Intelligent Energy and Power Systems (IEPS). [1]

Online inference

PAC-Bayes Meets Online Contextual Optimization

Rate paper: 👍 👎 ♥ Save

Abstract
The predict-then-optimize paradigm bridges online learning and contextual optimization in dynamic environments. Previous works have investigated the sequential updating of predictors using feedback from downstream decisions to minimize regret in the full-information settings. However, existing approaches are predominantly frequentist, rely heavily on gradient-based strategies, and employ deterministic predictors that could yield high variance in practice despite their asymptotic guarantees. This work introduces, to the best of our knowledge, the first Bayesian online contextual optimization framework. Grounded in PAC-Bayes theory and general Bayesian updating principles, our framework achieves $\mathcal{O}(\sqrt{T})$ regret for bounded and mixable losses via a Gibbs posterior, eliminates the dependence on gradients through sequential Monte Carlo samplers, and thereby accommodates nondifferentiable problems. Theoretical developments and numerical experiments substantiate our claims.

AI Summary

The paper presents a framework for online decision-making under uncertainty, using a PAC-Bayes approach to provide probabilistic guarantees on the performance of the decisions. [3]
The framework is based on a Bayesian model averaging (BMA) approach, which combines multiple models to make predictions and decisions. [3]
The BMA framework uses a shrinkage prior to regularize the posterior distribution, and a temperature parameter to control the amount of regularization. [3]
The paper also presents a variant of the BMA framework called BGS, which uses online gradient descent to minimize the mean squared error loss between the uncertainty realization and the predicted value. [3]
The experimental results show that the proposed frameworks outperform other state-of-the-art methods in terms of decision-making performance. [3]
PAC-Bayes: A framework for providing probabilistic guarantees on the performance of decisions, based on the PAC (Probably Approximately Correct) learning theory. [3]
Bayesian Model Averaging (BMA): An approach to combining multiple models to make predictions and decisions, by averaging their posterior distributions. [3]
Shrinkage Prior: A regularization technique used in BMA to regularize the posterior distribution, by shrinking the model parameters towards a prior distribution. [3]
Temperature Parameter: A parameter used in BMA to control the amount of regularization, by adjusting the shrinkage rate. [3]
Online Gradient Descent (OGD): An algorithm for minimizing the mean squared error loss between the uncertainty realization and the predicted value. [3]

Data Science Development Tools

Data-Driven Methods and AI in Engineering Design: A Systematic Literature Review Focusing on Challenges and Opportunities

Rate paper: 👍 👎 ♥ Save

Abstract
The increasing availability of data and advancements in computational intelligence have accelerated the adoption of data-driven methods (DDMs) in product development. However, their integration into product development remains fragmented. This fragmentation stems from uncertainty, particularly the lack of clarity on what types of DDMs to use and when to employ them across the product development lifecycle. To address this, a necessary first step is to investigate the usage of DDM in engineering design by identifying which methods are being used, at which development stages, and for what application. This paper presents a PRISMA systematic literature review. The V-model as a product development framework was adopted and simplified into four stages: system design, system implementation, system integration, and validation. A structured search across Scopus, Web of Science, and IEEE Xplore (2014--2024) retrieved 1{,}689 records. After screening, 114 publications underwent full-text analysis. Findings show that machine learning (ML) and statistical methods dominate current practice, whereas deep learning (DL), though still less common, exhibits a clear upward trend in adoption. Additionally, supervised learning, clustering, regression analysis, and surrogate modeling are prevalent in design, implementation, and integration system stages but contributions to validation remain limited. Key challenges in existing applications include limited model interpretability, poor cross-stage traceability, and insufficient validation under real-world conditions. Additionally, it highlights key limitations and opportunities such as the need for interpretable hybrid models. This review is a first step toward design-stage guidelines; a follow-up synthesis should map computer science algorithms to engineering design problems and activities.

AI Summary

Data-Driven Methods (DDMs): approaches that utilize data to inform design decisions. [2]
The use of data-driven methods in mechanical engineering design is increasing, with a focus on system integration and validation. [1]

Machine Learning Operations

Operator Learning at Machine Precision

Rate paper: 👍 👎 ♥ Save

Abstract
Neural operator learning methods have garnered significant attention in scientific computing for their ability to approximate infinite-dimensional operators. However, increasing their complexity often fails to substantially improve their accuracy, leaving them on par with much simpler approaches such as kernel methods and more traditional reduced-order models. In this article, we set out to address this shortcoming and introduce CHONKNORIS (Cholesky Newton--Kantorovich Neural Operator Residual Iterative System), an operator learning paradigm that can achieve machine precision. CHONKNORIS draws on numerical analysis: many nonlinear forward and inverse PDE problems are solvable by Newton-type methods. Rather than regressing the solution operator itself, our method regresses the Cholesky factors of the elliptic operator associated with Tikhonov-regularized Newton--Kantorovich updates. The resulting unrolled iteration yields a neural architecture whose machine-precision behavior follows from achieving a contractive map, requiring far lower accuracy than end-to-end approximation of the solution operator. We benchmark CHONKNORIS on a range of nonlinear forward and inverse problems, including a nonlinear elliptic equation, Burgers' equation, a nonlinear Darcy flow problem, the Calderón problem, an inverse wave scattering problem, and a problem from seismic imaging. We also present theoretical guarantees for the convergence of CHONKNORIS in terms of the accuracy of the emulated Cholesky factors. Additionally, we introduce a foundation model variant, FONKNORIS (Foundation Newton--Kantorovich Neural Operator Residual Iterative System), which aggregates multiple pre-trained CHONKNORIS experts for diverse PDEs to emulate the solution map of a novel nonlinear PDE. Our FONKNORIS model is able to accurately solve unseen nonlinear PDEs such as the Klein--Gordon and Sine--Gordon equations.

AI Summary

The NGM is used to learn an approximate Hessian matrix of the loss function, which is then inverted to obtain the optimal update direction. [3]
The authors propose a new algorithm called Natural Gradient Method for Operator Learning (NGMOL), which combines the benefits of both NGM and operator learning. [3]
Natural Gradient Method (NGM): A method for optimizing the parameters of a system by iteratively updating them based on the natural gradient of the loss function. [3]
Operator Learning: The process of learning an operator that maps inputs to outputs, often used in inverse problems. [3]
Hessian Matrix: A square matrix of second partial derivatives of a scalar-valued function. [3]
Approximate Hessian: An approximation of the true Hessian matrix, often obtained using numerical methods. [3]
The paper discusses a novel approach for operator learning using the natural gradient method (NGM) and its application to various inverse problems in physics, including seismic imaging. [2]
The results show that NGMOL outperforms other state-of-the-art methods in terms of accuracy and computational efficiency. [1]
NGMOL is applied to several inverse problems in physics, including seismic imaging, Calderon's problem, and inverse wave scattering. [0]

Machine Learning Lifecycle

On the Origin of Algorithmic Progress in AI

Rate paper: 👍 👎 ♥ Save

Abstract
Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.

AI Summary

Algorithmic progress in language models exhibits fundamentally different behavior across compute scales. [3]
Algorithmic progress: The improvement in training efficiency and capabilities of language models over time. [3]
Scale-dependent innovations: Innovations whose impact on efficiency gains varies depending on the compute scale. [3]
Algorithmic progress is not a single number, but rather depends on both the reference algorithm and target compute scale. [3]
Scale-dependent innovations are critical to understanding algorithmic progress and its implications for the future of AI. [3]
The study's experiments are conducted at small scales compared to more recent scaling studies. [3]
Scale-dependent innovations, such as the LSTM-to-Transformer transition and Chinchilla rebalancing, account for most of the efficiency gains at frontier scales. [2]
The concentration of progress in architectural transitions suggests that future progress may depend on discovering fundamentally new architectures rather than incremental refinements of existing ones. [1]

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

MLOps
Machine Learning Deployment

You can edit or add more interests any time.

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback