Hi!

Your personalized paper recommendations for 12 to 16 January, 2026.
Shanghai Jiao Tong University
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Insights
  • The main idea is not explicitly stated, but it seems to focus on the creation of more advanced LLMs that can perform various tasks such as summarization, reasoning, and data science. [3]
  • The text is about developing more advanced language models that can perform various tasks such as summarization, reasoning, and data science. [3]
  • It's like creating a super smart assistant that can help with many things. [3]
  • Lack of clear structure and organization Insufficient context for some sections The provided text appears to be a collection of research papers, conference proceedings, and coding prompts related to the development of large language models (LLMs) for scientific discovery and machine learning tasks. [2]
Abstract
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.
Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

This paper directly addresses the critical need for long-term autonomy in machine learning systems, aligning with your interest in Model Monitoring and Machine Learning Lifecycle. The focus on iterative correction over extended periods is highly relevant to building robust and adaptable AI agents.
University of Cambridge
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Insights
  • They use the MOBO (Multi-fidelity Bayesian Optimization) algorithm to search for optimal hyperparameters. [3]
  • MOBO: Multi-fidelity Bayesian Optimization CNN: Convolutional Neural Network CIFAR-10: A dataset of images for image classification SOTA: State-of-the-Art MAC: Multiply-accumulate operation [3]
  • The authors present an approach to optimizing machine learning models for both performance and energy efficiency. [2]
  • However, the energy consumption of the optimized model is only 0.39 mJ, making it more energy-efficient than the state-of-the-art Spike Aggregation Transformer (SAFormer). [1]
Abstract
The ubiquity of machine learning (ML) and the demand for ever-larger models bring an increase in energy consumption and environmental impact. However, little is known about the energy scaling laws in ML, and existing research focuses on training cost -- ignoring the larger cost of inference. Furthermore, tools for measuring the energy consumption of ML do not provide actionable feedback. To address these gaps, we developed Energy Consumption Optimiser (ECOpt): a hyperparameter tuner that optimises for energy efficiency and model performance. ECOpt quantifies the trade-off between these metrics as an interpretable Pareto frontier. This enables ML practitioners to make informed decisions about energy cost and environmental impact, while maximising the benefit of their models and complying with new regulations. Using ECOpt, we show that parameter and floating-point operation counts can be unreliable proxies for energy consumption, and observe that the energy efficiency of Transformer models for text generation is relatively consistent across hardware. These findings motivate measuring and publishing the energy metrics of ML models. We further show that ECOpt can have a net positive environmental impact and use it to uncover seven models for CIFAR-10 that improve upon the state of the art, when considering accuracy and energy efficiency together.
Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

Given your interest in Machine Learning Infrastructure and MLOps, this paper’s exploration of energy scaling laws in ML is incredibly valuable. Understanding the cost of inference, particularly in terms of energy consumption, is essential for efficient and sustainable ML deployments.
University of Maryland
AI Insights
  • The argument from amazingness, which suggests that language models must be operating in human-like ways because they perform well on certain tasks, is unreliable and unnecessary for their value in computational cognitive modeling. [3]
  • Algorithmic/representational level: concerns the mechanisms and representations used by a system to solve problems. [3]
  • Computational theory level: examines the underlying principles and algorithms that govern a system's behavior. [3]
  • Language models are not suitable as model systems at any of Marr's three levels: implementation, algorithmic/representational, and computational theory. [2]
Abstract
Futrell and Mahowald claim LMs "serve as model systems", but an assessment at each of Marr's three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.
Why we are recommending this paper?
Due to your Interest in Model Monitoring

This paper’s critical examination of Large Language Models from multiple perspectivesβ€”including Marr’s levels of analysisβ€”is a strong match for your interest in Machine Learning Validation. It challenges fundamental assumptions and encourages a deeper understanding of model capabilities.
Universit de Lyon
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Insights
  • Unlike usual, the training and validation sets have the property that the validation set is much larger than the training set. [3]
  • The expectile or adaptive expectile LASSO estimators of the model parameters are calculated using a randomly selected training set from the entire database. [3]
  • Afterwards, the optimal model is selected by minimizing the cross-validation mean score on the validation set. [3]
  • The consistency of the CV expectile model estimator means that we can directly identify the zero coefficients of the true model, and estimate those that are not zero, with a probability that converges to 1 as the number of observations tends to infinity. [3]
  • This model selection method is useful for large models because it allows insignificant variables to be identified directly without the need for hypothesis testing. [3]
  • The adaptive LASSO expectile method on the all observations produces also estimators that are shrunk directly to zero. [3]
  • Cross-validation on a large validation set further enhances the robustness of our model estimator. [3]
  • This result is particularly noteworthy in the context of numerical analysis, as theernetfunction in the R SALESpackage does not provide hypothesis tests on the parameters of a linear model. [3]
  • The CV expectile method is indicated when model errors are asymmetric, thereby precluding the classic assumption of normality and rendering the use of the classical least squares method inappropriate. [3]
  • The paper proposes and studies theoretically and numerically the expectile approach using the Train-Test split method, which has not been treated in the literature before. [2]
Abstract
For linear models that may have asymmetric errors, we study variable selection by cross-validation. The data are split into training and validation sets, with the number of observations in the validation set much larger than in the training set. For the model coefficients, the expectile or adaptive LASSO expectile estimators are calculated on the training set. These estimators will be used to calculate the cross-validation mean score (CVS) on the validation set. We show that the model that minimizes CVS is consistent in two cases: when the number of explanatory variables is fixed or when it depends on the number of observations. Monte Carlo simulations confirm the theoretical results and demonstrate the superiority of our estimation method compared to two others in the literature. The usefulness of the CV expectile model selection technique is illustrated by applying it to real data sets.
Why we are recommending this paper?
Due to your Interest in Machine Learning Validation

With your focus on Machine Learning Testing and Fault tolerance, this paper’s use of cross-validation for model selection is directly relevant. The study of asymmetric errors and expectile regression provides valuable insights into robust model evaluation techniques.
Michigan State University
AI Insights
  • Calibration: A measure of how well a model's confidence matches its actual performance. [3]
  • In other words, it measures the accuracy of the model's predictions when compared to its own uncertainty estimates. [3]
  • The authors emphasize that accurate uncertainty estimation is crucial for reliable decision-making in applications such as question-answering, natural language processing, and medical diagnosis. [3]
  • The paper discusses the importance of evaluating the uncertainty of large language models (LLMs) in their responses. [2]
  • The paper focuses primarily on theoretical aspects and does not provide concrete experimental results or comparisons with existing methods. [1]
Abstract
The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.
Why we are recommending this paper?
Due to your Interest in Online inference

Considering your interest in Machine Learning Validation and Model Monitoring, this paper addresses a critical issueβ€”the calibration of confidence estimates in Large Reasoning Models. The RMCB benchmark provides a systematic approach to assessing the reliability of these models in high-stakes scenarios.
Yunnan University
AI Insights
  • The paper presents a new ensemble method called Behavioral Profile Ensemble (BPE) for improving the performance of machine learning models. [3]
  • BPE is designed to handle both homogeneous and heterogeneous ensembles, and it can be used with various types of decision trees. [3]
  • In the homogeneous case, BPE exhibits statistically significant superiority over stacking and all DES variants, but not over simple average and weighted average. [3]
  • The paper also presents a new ensemble method called Decision Tree Ensemble (DES) for improving the performance of machine learning models in heterogeneous cases. [3]
  • The results show that BPE outperforms other methods in both homogeneous and heterogeneous ensembles, especially when the decision trees are unpruned. [3]
  • BPE: Behavioral Profile Ensemble DES: Decision Tree Ensemble BPE is a robust ensemble method that can handle both homogeneous and heterogeneous cases. [3]
  • It outperforms other methods in both scenarios, especially when the decision trees are unpruned. [3]
  • The paper does not provide a clear explanation of how BPE works in detail. [1]
Abstract
Ensemble learning is widely recognized as a pivotal strategy for pushing the boundaries of predictive performance. Traditional static ensemble methods, such as Stacking, typically assign weights by treating each base learner as a holistic entity, thereby overlooking the fact that individual models exhibit varying degrees of competence across different regions of the instance space. To address this limitation, Dynamic Ensemble Selection (DES) was introduced. However, both static and dynamic approaches predominantly rely on the divergence among different models as the basis for integration. This inter-model perspective neglects the intrinsic characteristics of the models themselves and necessitates a heavy reliance on validation sets for competence estimation. In this paper, we propose the Behavioral Profiling Ensemble (BPE) framework, which introduces a novel paradigm shift. Unlike traditional methods, BPE constructs a ``behavioral profile'' intrinsic to each model and derives integration weights based on the deviation between the model's response to a specific test instance and its established behavioral profile. Extensive experiments on both synthetic and real-world datasets demonstrate that the algorithm derived from the BPE framework achieves significant improvements over state-of-the-art ensemble baselines. These gains are evident not only in predictive accuracy but also in computational efficiency and storage resource utilization across various scenarios.
Why we are recommending this paper?
Due to your Interest in Model Monitoring
Massachusetts Institute of Technology MIT
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Insights
  • The omitted variable bias (OVB) in the sample selection model with confounding in selection can be represented as E[(g0-gs)(Ξ±0-Ξ±s)], where g0 and gs are the long and short outcome regressions, and Ξ±0 and Ξ±s are the corresponding long and short Riesz representers. [3]
  • The sensitivity parameter C^2_S measures the share of variation in the long representer that is not captured by the short representer. [3]
  • Omitted variable bias (OVB): the difference between the true parameter and the estimated parameter due to omitted variables. [3]
  • Sensitivity parameter C^2_S: measures the share of variation in the long representer that is not captured by the short representer. [3]
  • The sensitivity parameter C^2_S measures the gain in precision from observing the unobserved confounder A. [3]
  • The terms 1/(p_d(X)Ο€(Β·)) grow when either the treatment propensity p_d(X) or the selection probability Ο€(Β·) is small, summarizing the overlap and selection difficulty through an average inverse-probability scale. [2]
  • The OVB admits a representation of |ΞΈ0-ΞΈs|2 = ρ^2B^2 ≀ B^2, with B^2 being identified from the observed data. [1]
  • The OVB can be represented as E[(g0-gs)(Ξ±0-Ξ±s)], and its magnitude is bounded by B^2. [0]
Abstract
In this paper, we extend the Riesz representation framework to causal inference under sample selection, where both treatment assignment and outcome observability are non-random. Formulating the problem in terms of a Riesz representer enables stable estimation and a transparent decomposition of omitted variable bias into three interpretable components: a data-identified scale factor, outcome confounding strength, and selection confounding strength. For estimation, we employ the ForestRiesz estimator, which accounts for selective outcome observability while avoiding the instability associated with direct propensity score inversion. We assess finite-sample performance through a simulation study and show that conventional double machine learning approaches can be highly sensitive to tuning parameters due to their reliance on inverse probability weighting, whereas the ForestRiesz estimator delivers more stable performance by leveraging automatic debiased machine learning. In an empirical application to the gender wage gap in the U.S., we find that our ForestRiesz approach yields larger treatment effect estimates than a standard double machine learning approach, suggesting that ignoring sample selection leads to an underestimation of the gender wage gap. Sensitivity analysis indicates that implausibly strong unobserved confounding would be required to overturn our results. Overall, our approach provides a unified, robust, and computationally attractive framework for causal inference under sample selection.
Why we are recommending this paper?
Due to your Interest in Machine Learning Validation
HDC LABS
AI Insights
  • The framework integrates an object detector, an aesthetics assessor, a vision-language model for prompt-image alignment, and a trained preference classifier to incorporate user-specific criteria. [2]
  • The paper proposes a multi-stage diffusion-based pipeline for domain-specific dataset generation and automated curation. [1]
Abstract
In this paper, we present an automated pipeline for generating domain-specific synthetic datasets with diffusion models, addressing the distribution shift between pre-trained models and real-world deployment environments. Our three-stage framework first synthesizes target objects within domain-specific backgrounds through controlled inpainting. The generated outputs are then validated via a multi-modal assessment that integrates object detection, aesthetic scoring, and vision-language alignment. Finally, a user-preference classifier is employed to capture subjective selection criteria. This pipeline enables the efficient construction of high-quality, deployable datasets while reducing reliance on extensive real-world data collection.
Why we are recommending this paper?
Due to your Interest in Machine Learning Deployment
Australian National University
AI Insights
  • Machine learning can be used to improve the runtime performance of GEMM routines by automatically selecting the optimal number of threads. [2]
  • The approach can be extended to other BLAS operations and to a more diverse set of computer systems, including heterogeneous architectures that use a mix of CPUs and accelerators. [1]
Abstract
The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.
Why we are recommending this paper?
Due to your Interest in Machine Learning Operations
PingCAP
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Insights
  • HDC generation involves extracting representative entities for each database to facilitate efficient data exploration across multiple databases. [3]
  • It also includes a self-refinement chain to correct errors in generated SQL statements. [3]
  • The system demonstrates its capabilities through two real-world scenarios: the Financial dataset and the Bird dataset, showcasing its ability to provide insights and facilitate user-system interaction. [3]
  • HDC: Hierarchical Data Context - a summary of the data that includes a description, keywords, table information, and more. [3]
  • TiChart: Chart Selection - a component that selects the most suitable chart type to present analysis results by visualization. [3]
  • Exploration Efficiency: The ability of the system to efficiently explore data across multiple databases. [3]
  • TiInsight is a SQL-based automated cross-domain exploratory data analysis system that utilizes large language models to facilitate user-system interaction and provide powerful hierarchical data context (HDC) generation, text-to-SQL (TiSQL), chart selection (TiChart), and exploration efficiency. [2]
  • TiSQL is a schema filtering framework based on the map-reduce paradigm that filters tables and columns using clarified questions and cosine similarity. [1]
Abstract
The SQL-based exploratory data analysis has garnered significant attention within the data analysis community. The emergence of large language models (LLMs) has facilitated the paradigm shift from manual to automated data exploration. However, existing methods generally lack the ability for cross-domain analysis, and the exploration of LLMs capabilities remains insufficient. This paper presents TiInsight, an SQL-based automated cross-domain exploratory data analysis system. First, TiInsight offers a user-friendly GUI enabling users to explore data using natural language queries. Second, TiInsight offers a robust cross-domain exploratory data analysis pipeline: hierarchical data context (i.e., HDC) generation, question clarification and decomposition, text-to-SQL (i.e., TiSQL), and data visualization (i.e., TiChart). Third, we have implemented and deployed TiInsight in the production environment of PingCAP and demonstrated its capabilities using representative datasets. The demo video is available at https://youtu.be/JzYFyYd-emI.
Why we are recommending this paper?
Due to your Interest in Data Science Development Tools
Electronics and Telecommunications Research Institute
AI Insights
  • The unified dynamical equation proposed here may serve as a foundational law for understanding collective intelligence in both natural and artificial systems. [3]
  • Learning and homeostatic regulation are naturally interpreted as processes that reshape the TDOS, selectively stabilizing slow collective modes that support robust inference, memory, and context-dependent computation. [3]
  • Learning, inference, and emergence are not distinct processes, but manifestations of a single dynamical principle governing high-dimensional adaptive systems. [3]
  • The precise biological mechanisms implementing effective noise, homeostatic regulation, and metric learning in neural circuits warrant further investigation. [3]
  • Cognitive function is governed not by microscopic units or precise activity patterns, but by the collective organization of dynamic time scales. [2]
  • TDOS: Time-Scale Density of States MSRJD: Mean-Field Stochastic Resonance Jump-Diffusion The present theory explains how stable cognition can emerge from heterogeneous, stochastic, and irregular substrates. [1]
  • The role of higher-order loop corrections and nonperturbative effects near strong criticality remains an important open problem. [0]
Abstract
Learning, inference, and emergence in biological and artificial systems are often studied within disparate theoretical frameworks, ranging from energy-based models to recurrent and attention-based architectures. Here we develop a unified dynamical field theory in which learning and inference are governed by a minimal stochastic dynamical equation admitting a Martin--Siggia--Rose--Janssen--de Dominicis formulation. Within this framework, inference corresponds to saddle-point trajectories of the associated action, while fluctuation-induced loop corrections render collective modes dynamically emergent and generate nontrivial dynamical time scales. A central result of this work is that cognitive function is controlled not by microscopic units or precise activity patterns, but by the collective organization of dynamical time scales. We introduce the \emph{time-scale density of states} (TDOS) as a compact diagnostic that characterizes the distribution of collective relaxation modes governing inference dynamics. Learning and homeostatic regulation are naturally interpreted as processes that reshape the TDOS, selectively generating slow collective modes that support stable inference, memory, and context-dependent computation despite stochasticity and structural irregularity. This framework unifies energy-based models, recurrent neural networks, transformer architectures, and biologically motivated homeostatic dynamics within a single physical description, and provides a principled route toward understanding cognition as an emergent dynamical phenomenon.
Why we are recommending this paper?
Due to your Interest in Online inference
RWTH Aachen University
AI Insights
  • The paper discusses two logics over weighted structures: FO(SUM) and IFP(SUM), with respect to their ability to express queries over feedforward neural networks. [3]
  • Other aggregation operators (counting, arithmetic mean, minimum and maximum) can be expressed in terms of summation alone. [3]
  • FO(SUM) - First-order logic over weighted structures with summation IFP(SUM) - Inflationary fixed-point logic over weighted structures with summation FNNs - Feedforward neural networks Rlin - Linear functions [3]
  • FO(SUM) can simulate FO(Rlin,f) on bounded depth FNNs, but it is unclear whether this result can be extended from Rlin to R or lifted to FNNs of arbitrary input dimension. [2]
Abstract
In this paper, I discuss two logics for weighted finite structures: first-order logic with summation (FO(SUM)) and its recursive extension IFP(SUM). These logics originate from foundational work by GrΓ€del, Gurevich, and Meer in the 1990s. In recent joint work with Standke, Steegmans, and Van den Bussche, we have investigated these logics as query languages for machine learning models, specifically neural networks, which are naturally represented as weighted graphs. I present illustrative examples of queries to neural networks that can be expressed in these logics and discuss fundamental results on their expressiveness and computational complexity.
Why we are recommending this paper?
Due to your Interest in Machine Learning Infrastructure
Technical University of Munich
AI Insights
  • Statistical model Ο†w: A neural network that takes the input instance x as input and outputs parameters ΞΈ that interact with the CO-oracle. [3]
  • COAML is a framework that combines machine learning and optimization techniques to find the best solution. [3]
  • It's like having a super-smart assistant that can learn from data and make predictions to help you solve the problem efficiently. [3]
  • The paper discusses a framework called COAML (Combinatorial Optimization and Machine Learning) that integrates machine learning with combinatorial optimization problems. [2]
Abstract
Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.
Why we are recommending this paper?
Due to your Interest in Machine Learning Infrastructure
University of Edinburgh
AI Insights
  • The problem of view change optimization in parallel BFT systems involves minimizing network delay when a leader fails and needs to be replaced. [2]
Abstract
The parallel Byzantine Fault Tolerant (BFT) protocol is viewed as a promising solution to address the consensus scalability issue of the permissioned blockchain. One of the main challenges in parallel BFT is the view change process that happens when the leader node fails, which can lead to performance bottlenecks. Existing parallel BFT protocols typically rely on passive view change mechanisms with blind leader rotation. Such approaches frequently select unavailable or slow nodes as leaders, resulting in degraded performance. To address these challenges, we propose a View Change Optimization (VCO) model based on mixed integer programming that optimizes leader selection and follower reassignment across parallel committees by considering communication delays and failure scenarios. We applied a decomposition method with efficient subproblems and improved benders cuts to solve the VCO model. Leveraging the results of improved decomposition solution method, we propose an efficient iterative backup leader selection algorithm as views proceed. By performing experiments in Microsoft Azure cloud environments, we demonstrate that the VCO-driven parallel BFT outperforms existing configuration methods under both normal operation and faulty condition. The results show that the VCO model is effective as network size increases, making it a suitable solution for high-performance parallel BFT systems.
Why we are recommending this paper?
Due to your Interest in Fault tolerance
University of MilanoBicocca
AI Insights
  • The RAG system uses a combination of natural language processing and machine learning algorithms to retrieve relevant information from a knowledge base and generate accurate responses. [3]
  • However, it struggles with ambiguous terminology, complex multi-condition formulations, and unclear phrasing. [3]
  • The results show that Llama is faster but less accurate than Qwen, while Mistral falls in between. [3]
  • Retrieval-Augmented Generation (RAG): A system that uses a combination of natural language processing and machine learning algorithms to retrieve relevant information from a knowledge base and generate accurate responses. [3]
  • The results may not generalize to other domains or applications. [3]
  • The study evaluates the performance of a Retrieval-Augmented Generation (RAG) system for troubleshooting procedures. [2]
  • The results suggest that Llama is a suitable choice when speed is a priority, while Qwen is preferred when accuracy is the main concern. [1]
Abstract
In today's complex industrial environments, operators must often navigate through extensive technical manuals to identify troubleshooting procedures that may help react to some observed failure symptoms. These manuals, written in natural language, describe many steps in detail. Unfortunately, the number, magnitude, and articulation of these descriptions can significantly slow down and complicate the retrieval of the correct procedure during critical incidents. Interestingly, Retrieval Augmented Generation (RAG) enables the development of tools based on conversational interfaces that can assist operators in their retrieval tasks, improving their capability to respond to incidents. This paper presents the results of a set of experiments that derive from the analysis of the troubleshooting procedures available in Fincantieri, a large international company developing complex naval cyber-physical systems. Results show that RAG can assist operators in reacting promptly to failure symptoms, although specific measures have to be taken into consideration to cross-validate recommendations before actuating them.
Why we are recommending this paper?
Due to your Interest in Fault tolerance
Heidelberg University
AI Insights
  • Unordered collection was the most prevalent root cause of flakiness, accounting for 72 instances out of 115 total flaky tests. [3]
  • The study found that LLM-generated tests had a higher proportion of flaky tests compared to existing tests. [2]
  • RQ3: Flakiness transfer experiment, where the study investigates whether flakiness is transferred from one iteration to another. [0]
Abstract
Flaky tests are a common problem in software testing. They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. However, its prevalence and underlying causes are unclear. We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite. We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a certain order that is not guaranteed ("unordered collection"), which was present in 72 of 115 flaky tests (63%). Furthermore, both LLMs transferred the flakiness from the existing tests to the newly generated tests via the provided prompt context. Our experiments suggest that flakiness transfer is more prevalent in closed-source systems such as SAP HANA than in open-source systems. Our study informs developers on what types of flakiness to expect from LLM-generated tests. It also highlights the importance of providing LLMs with tailored context when employing LLMs for test generation.
Why we are recommending this paper?
Due to your Interest in Machine Learning Testing
MIT
AI Insights
  • calibration. [3]
  • CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition MUNIT-style variation model: A type of generative adversarial network (GAN) used for image-to-image translation Nuisance coverage: A method of training models to be robust against natural corruptions by covering a wide range of possible corruptions Combining nuisance coverage and adversarial refinement is a promising direction for robustness to natural corruptions. [3]
  • Future work should extend this evaluation to more datasets, architectures, and corruption types to test the generality of these findings. [3]
  • Model-based training improves robustness and calibration. [2]
Abstract
Robustness to natural corruptions remains a critical challenge for reliable deep learning, particularly in safety-sensitive domains. We study a family of model-based training approaches that leverage a learned nuisance variation model to generate realistic corruptions, as well as new hybrid strategies that combine random coverage with adversarial refinement in nuisance space. Using the Challenging Unreal and Real Environments for Traffic Sign Recognition dataset (CURE-TSR), with Snow and Rain corruptions, we evaluate accuracy, calibration, and training complexity across corruption severities. Our results show that model-based methods consistently outperform baselines Vanilla, Adversarial Training, and AugMix baselines, with model-based adversarial training providing the strongest robustness under across all corruptions but at the expense of higher computation and model-based data augmentation achieving comparable robustness with $T$ less computational complexity without incurring a statistically significant drop in performance. These findings highlight the importance of learned nuisance models for capturing natural variability, and suggest a promising path toward more resilient and calibrated models under challenging conditions.
Why we are recommending this paper?
Due to your Interest in Machine Learning Resilience

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Data Science Development Environment and Productivity
  • MLOps
You can edit or add more interests any time.