π― Top Personalized Recommendations
AI Summary - Anthropomimetic Uncertainty: A measure of the uncertainty expressed by a model in its output, which can be used to evaluate its reliability and trustworthiness. [3]
- The article discusses the limitations and challenges of Large Language Models (LLMs) in various applications. [2]
Abstract
Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure patterns differ fundamentally from those of traditional machine learning models. This paper presents a system-level taxonomy of fifteen hidden failure modes that arise in real-world LLM applications, including multi-step reasoning drift, latent inconsistency, context-boundary degradation, incorrect tool invocation, version drift, and cost-driven performance collapse. Using this taxonomy, we analyze the growing gap in evaluation and monitoring practices: existing benchmarks measure knowledge or reasoning but provide little insight into stability, reproducibility, drift, or workflow integration. We further examine the production challenges associated with deploying LLMs - including observability limitations, cost constraints, and update-induced regressions - and outline high-level design principles for building reliable, maintainable, and cost-aware LLM systems. Finally, we outline high-level design principles for building reliable, maintainable, and cost-aware LLM-based systems. By framing LLM reliability as a system-engineering problem rather than a purely model-centric one, this work provides an analytical foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.
Why we think this paper is great for you:
This paper directly addresses the critical need for understanding and mitigating failures in AI systems, which is essential for building reliable applications in production environments. It offers a valuable taxonomy for ensuring the robustness of your deployed models.
AI Summary - Cross-validation: A method used to evaluate the performance of a model by training it on a subset of data and testing it on another subset. [3]
- Repeated double cross-validation: An extension of cross-validation that involves repeating the process multiple times with different subsets of data. [3]
- Performance metrics: Quantitative measures used to evaluate the performance of a model, such as accuracy, precision, and recall. [3]
- The article concludes that proper validation is crucial in chemometric models to avoid overfitting and ensure reliable results. [3]
- It highlights the limitations and potential biases of cross-validation methods. [2]
- The article discusses the importance of proper validation in chemometric models. [1]
Abstract
The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.
Why we think this paper is great for you:
This paper provides a foundational set of rules for assessing a model's ability to generalize, which is crucial for establishing reliable validation plans. You will find its guidance invaluable for ensuring the trustworthiness of your machine learning models.
AI Summary - They conduct experiments to assess the effectiveness of their automated annotation pipeline using three complementary analyses: agreement among MLLMs, accuracy relative to human annotations, and correctness of preference judgments produced by Qwen3-VL-Plus as an external oracle model. [3]
- The paper discusses a safety benchmark for multimodal systems, specifically language models (LLMs) and their applications in various domains. [2]
Abstract
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
Why we think this paper is great for you:
This work focuses on detecting unsafe reasoning in multimodal models, directly supporting your goal of deploying safe and resilient AI systems. It offers insights into monitoring and mitigating risks in complex production deployments.
Abstract
Deep Learning (DL) compilers have been widely utilized to optimize DL models for efficient deployment across various hardware. Due to their vital role in the DL ecosystem, ensuring their reliability and security is critical. However, existing approaches have limitations in testing optimization stages, which is the core functionality of DL compilers, due to the difficulty in generating optimization-aware tests. In this paper, we proposed OATest, a novel approach for synthesizing optimization-aware computational graphs. The approach combines patterns extracted from documented tests for optimization and incorporates them into seed computational graphs, enabling broader exploration of optimization paths. To guarantee the optimization-awareness of generated graphs, OATest introduces the edges reusing strategy to establish strong connections between patterns and contexts. Additionally, to solve the validity challenge for the generated graphs, OATest employs an auxiliary layers addition strategy to resolve broken constraints. Equipped with two distinct test oracles, OATest applies differential testing to evaluate the two widely used DL compilers (i.e., TVM and ONNXRuntime). Our experimental results show that OATest outperforms the state-of-the-art method by detecting more bugs and achieving higher code coverage in TVM and ONNXRutimes. Additionally, OATest uncovers 58 previously unknown bugs, 36 of which have been confirmed or fixed by developers.
Why we think this paper is great for you:
Ensuring the reliability of deep learning compilers is vital for robust machine learning infrastructure, and this paper offers methods for generating effective tests. This directly contributes to your efforts in machine learning testing and building resilient systems.
AI Summary - The authors introduce two indicators, Ξ΄1 and Ξ΄2, to measure the impact of data contamination on model performance. [3]
- The study demonstrates that the proposed updated dataset consistently achieves lower variance compared to its counterparts, indicating more stable and robust evaluation metrics. [3]
- Data contamination: The inflated performance of a model on a specific dataset or benchmark due to the leakage of test data. [3]
- Ξ΄1: Measures the performance difference between the model's evaluation results after training solely on the test set and its zero-shot performance. [3]
- Ξ΄2: Compares the performance difference between models trained on both train and test sets versus those trained exclusively on the train set. [3]
- The updated dataset and indicators introduced in this work contribute to more reliable and robust evaluation metrics. [3]
- The paper proposes a novel framework for evaluating the performance of large language models (LLMs) in real-world scenarios, addressing the issue of data contamination. [2]
Abstract
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
Why we think this paper is great for you:
This research tackles data contamination to build more resilient datasets, which is fundamental for reliable LLM evaluation and overall model robustness. It provides practical insights for improving the integrity of your machine learning testing and validation processes.
AI Summary - { "id": "Insight-1", "description": "The Logarithmic Overfitting Ratio (LOR) and Composite Overfitting Ratio (COS) are essential metrics to evaluate model performance and detect overfitting or underfitting." } { "id": "Insight-2", "description": "Prefer Monte Carlo Cross Validation (MC CV) over k-Fold Cross Validation for larger, stable datasets." } { "id": "Insight-3", "description": "Always report the LOR and COS values to facilitate model selection and avoid overfitting or underfitting." } { "id": "Insight-4", "description": "Use a uniform metric across models for fair comparison, such as MAE or accuracy." } { "id": "Insight-5", "description": "Highlight the best model(s) in bold or with visual cues to facilitate interpretation of results." } { "name": "Logarithmic Overfitting Ratio (LOR)", "description": "A metric that evaluates the difference between training and test errors, indicating overfitting or underfitting." } { "name": "Composite Overfitting Ratio (COS)", "description": "A metric that combines LOR with standard deviation to evaluate model performance and detect overfitting or underfitting." } { "id": "Insight-6", "description": "Following these best practices will help you select meaningful results, avoid overfitting or underfitting, and improve the overall quality of your machine learning models." } { "id": "Insight-7", "description": "Remember to report all findings, including training/test metrics and standard deviations, to provide a comprehensive understanding of model performance." } { "id": "Weakness-1", "description": "The paper assumes that readers are familiar with machine learning concepts and terminology." } [3]
Abstract
Machine learning (ML) is increasingly adopted in scientific research, yet the quality and reliability of results often depend on how experiments are designed and documented. Poor baselines, inconsistent preprocessing, or insufficient validation can lead to misleading conclusions about model performance. This paper presents a practical and structured guide for conducting ML experiments in scientific applications, focussing on reproducibility, fair comparison, and transparent reporting. We outline a step-by-step workflow, from dataset preparation to model selection and evaluation, and propose metrics that account for overfitting and instability across validation folds, including the Logarithmic Overfitting Ratio (LOR) and the Composite Overfitting Score (COS). Through recommended practices and example reporting formats, this work aims to support researchers in establishing robust baselines and drawing valid evidence-based insights from ML models applied to scientific problems.
Why we think this paper is great for you:
This paper outlines best practices for designing and documenting ML experiments, directly enhancing the quality and reliability of your results. It offers guidance for improving your data science development environment and overall machine learning lifecycle.
Abstract
Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in the join's output, factorized ML has been proposed to decompose ML computations into sub-computations to be executed on each normalized dataset. However, there is insufficient discussion on how factorized ML could impact AI/ML inference over multi-way joins. To address the limitations, we propose a novel declarative InferF system, focusing on the factorization of arbitrary inference workflows represented as analyzable expressions over the multi-way joins. We formalize our problem to flexibly push down partial factorized computations to qualified nodes in the join tree to minimize the overall inference computation and join costs and propose two algorithms to resolve the problem: (1) a greedy algorithm based on a per-node cost function that estimates the influence on overall latency if a subset of factorized computations is pushed to a node, and (2) a genetic algorithm for iteratively enumerating and evaluating promising factorization plans. We implement InferF on Velox, an open-sourced database engine from Meta, evaluate it on real-world datasets, observed up to 11.3x speedups, and systematically summarized the factors that determine when factorized ML can benefit AI/ML inference workflows.
Why we think this paper is great for you:
This paper explores optimizing AI/ML inference computations, which is key for efficient online inference and robust machine learning infrastructure. It provides valuable techniques for improving the performance and cost-effectiveness of your deployed models.
Data Science Development Tools
Abstract
The increasing availability of data and advancements in computational intelligence have accelerated the adoption of data-driven methods (DDMs) in product development. However, their integration into product development remains fragmented. This fragmentation stems from uncertainty, particularly the lack of clarity on what types of DDMs to use and when to employ them across the product development lifecycle. To address this, a necessary first step is to investigate the usage of DDM in engineering design by identifying which methods are being used, at which development stages, and for what application. This paper presents a PRISMA systematic literature review. The V-model as a product development framework was adopted and simplified into four stages: system design, system implementation, system integration, and validation. A structured search across Scopus, Web of Science, and IEEE Xplore (2014--2024) retrieved 1{,}689 records. After screening, 114 publications underwent full-text analysis. Findings show that machine learning (ML) and statistical methods dominate current practice, whereas deep learning (DL), though still less common, exhibits a clear upward trend in adoption. Additionally, supervised learning, clustering, regression analysis, and surrogate modeling are prevalent in design, implementation, and integration system stages but contributions to validation remain limited. Key challenges in existing applications include limited model interpretability, poor cross-stage traceability, and insufficient validation under real-world conditions. Additionally, it highlights key limitations and opportunities such as the need for interpretable hybrid models. This review is a first step toward design-stage guidelines; a follow-up synthesis should map computer science algorithms to engineering design problems and activities.
AI Summary - Data-Driven Methods (DDMs): approaches that utilize data to inform design decisions. [2]
- The use of data-driven methods in mechanical engineering design is increasing, with a focus on system integration and validation. [1]
Machine Learning Operations
Abstract
Neural operator learning methods have garnered significant attention in scientific computing for their ability to approximate infinite-dimensional operators. However, increasing their complexity often fails to substantially improve their accuracy, leaving them on par with much simpler approaches such as kernel methods and more traditional reduced-order models. In this article, we set out to address this shortcoming and introduce CHONKNORIS (Cholesky Newton--Kantorovich Neural Operator Residual Iterative System), an operator learning paradigm that can achieve machine precision. CHONKNORIS draws on numerical analysis: many nonlinear forward and inverse PDE problems are solvable by Newton-type methods. Rather than regressing the solution operator itself, our method regresses the Cholesky factors of the elliptic operator associated with Tikhonov-regularized Newton--Kantorovich updates. The resulting unrolled iteration yields a neural architecture whose machine-precision behavior follows from achieving a contractive map, requiring far lower accuracy than end-to-end approximation of the solution operator. We benchmark CHONKNORIS on a range of nonlinear forward and inverse problems, including a nonlinear elliptic equation, Burgers' equation, a nonlinear Darcy flow problem, the CalderΓ³n problem, an inverse wave scattering problem, and a problem from seismic imaging. We also present theoretical guarantees for the convergence of CHONKNORIS in terms of the accuracy of the emulated Cholesky factors. Additionally, we introduce a foundation model variant, FONKNORIS (Foundation Newton--Kantorovich Neural Operator Residual Iterative System), which aggregates multiple pre-trained CHONKNORIS experts for diverse PDEs to emulate the solution map of a novel nonlinear PDE. Our FONKNORIS model is able to accurately solve unseen nonlinear PDEs such as the Klein--Gordon and Sine--Gordon equations.
AI Summary - The NGM is used to learn an approximate Hessian matrix of the loss function, which is then inverted to obtain the optimal update direction. [3]
- The authors propose a new algorithm called Natural Gradient Method for Operator Learning (NGMOL), which combines the benefits of both NGM and operator learning. [3]
- Natural Gradient Method (NGM): A method for optimizing the parameters of a system by iteratively updating them based on the natural gradient of the loss function. [3]
- Operator Learning: The process of learning an operator that maps inputs to outputs, often used in inverse problems. [3]
- Hessian Matrix: A square matrix of second partial derivatives of a scalar-valued function. [3]
- Approximate Hessian: An approximation of the true Hessian matrix, often obtained using numerical methods. [3]
- The paper discusses a novel approach for operator learning using the natural gradient method (NGM) and its application to various inverse problems in physics, including seismic imaging. [2]
- The results show that NGMOL outperforms other state-of-the-art methods in terms of accuracy and computational efficiency. [1]
- NGMOL is applied to several inverse problems in physics, including seismic imaging, Calderon's problem, and inverse wave scattering. [0]