Hi!

Your personalized paper recommendations for 19 to 23 January, 2026.

Explainable AI to Improve Machine Learning Reliability for Industrial Cyber-Physical Systems

Saxion University of Applied Sciences

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The approach has potential implications beyond the ML model itself, such as adjusting the monitoring subsystem or reducing data sampling overhead. (ML: 0.97)👍👎
The paper also discusses potential implications of XAI beyond the ML model itself, such as adjusting the monitoring subsystem or reducing data sampling overhead. (ML: 0.96)👍👎
Explainable Artificial Intelligence (XAI): A subfield of artificial intelligence that focuses on making AI models more transparent and interpretable. (ML: 0.95)👍👎
The paper demonstrates the effectiveness of XAI techniques in improving the reliability of machine learning models for industrial CPS. (ML: 0.93)👍👎
The paper presents an approach to improve the reliability of machine learning models for industrial cyber-physical systems (CPS) by using explainable artificial intelligence (XAI) techniques. (ML: 0.91)👍👎
Time-series data: A sequence of data points measured at regular time intervals, often used in industrial applications for monitoring and prediction. (ML: 0.86)👍👎
The approach is demonstrated on an experimental platform, and the results show improvements in prediction performance. (ML: 0.81)👍👎
The authors propose a custom, human-interpretable signal decomposition method called C-SHAP, which is used to analyze time-series data collected from industrial CPS machine operations. (ML: 0.73)👍👎
Cyber-Physical Systems (CPS): A system that integrates physical components with computational systems to monitor, control, or interact with the physical world. (ML: 0.73)👍👎
The custom signal decomposition method C-SHAP is shown to be a useful tool for analyzing time-series data. (ML: 0.64)👍👎

Abstract
Industrial Cyber-Physical Systems (CPS) are sensitive infrastructure from both safety and economics perspectives, making their reliability critically important. Machine Learning (ML), specifically deep learning, is increasingly integrated in industrial CPS, but the inherent complexity of ML models results in non-transparent operation. Rigorous evaluation is needed to prevent models from exhibiting unexpected behaviour on future, unseen data. Explainable AI (XAI) can be used to uncover model reasoning, allowing a more extensive analysis of behaviour. We apply XAI to to improve predictive performance of ML models intended for industrial CPS. We analyse the effects of components from time-series data decomposition on model predictions using SHAP values. Through this method, we observe evidence on the lack of sufficient contextual information during model training. By increasing the window size of data instances, informed by the XAI findings, we are able to improve model performance.

Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

This paper directly addresses MLOps concerns regarding reliability, a core interest for this user. It focuses on improving the trustworthiness of ML models in critical industrial environments, aligning with the need for fault tolerance and resilience.

An Empirical Study on Remote Code Execution in Machine Learning Model Hosting Ecosystems

University of Notre Dame

Rate paper: 👍 👎 ♥ Save

AI Insights

The study found that many detected issues correspond to exploitable attack vectors rather than stylistic or defensive programming concerns. (ML: 0.96)👍👎
The study highlights the importance of addressing remote code execution vulnerabilities in machine learning model hosting ecosystems. (ML: 0.94)👍👎
PyTorch Hub showed the highest relative impact with 38.46% of repositories containing at least one Semgrep-detected issue, primarily driven by CWE-502 and CWE-95. (ML: 0.90)👍👎
The findings suggest that many detected issues correspond to exploitable attack vectors, emphasizing the need for more robust security measures. (ML: 0.89)👍👎
CWE (Common Weakness Enumeration) is a classification of software security weaknesses. (ML: 0.84)👍👎
The study analyzed 36,697 repositories on Hugging Face Hub, 6,165 on OpenCSG, 3,193 on ModelScope, 16 on OpenMMLab, and 26 on PyTorch Hub. (ML: 0.82)👍👎
Semgrep is a tool for detecting common security vulnerabilities in code. (ML: 0.82)👍👎
The most common vulnerabilities found were CWE-502 (Deserialization of Untrusted Data) and CWE-95 (Eval Injection). (ML: 0.80)👍👎
CodeQL is a query language for analyzing the security of codebases. (ML: 0.78)👍👎
OWASP (Open Web Application Security Project) is a non-profit organization that provides resources and guidelines for secure coding practices. (ML: 0.76)👍👎

Abstract
Model-sharing platforms, such as Hugging Face, ModelScope, and OpenCSG, have become central to modern machine learning development, enabling developers to share, load, and fine-tune pre-trained models with minimal effort. However, the flexibility of these ecosystems introduces a critical security concern: the execution of untrusted code during model loading (i.e., via trust_remote_code or trust_repo). In this work, we conduct the first large-scale empirical study of custom model loading practices across five major model-sharing platforms to assess their prevalence, associated risks, and developer perceptions. We first quantify the frequency with which models require custom code to function and identify those that execute arbitrary Python files during loading. We then apply three complementary static analysis tools: Bandit, CodeQL, and Semgrep, to detect security smells and potential vulnerabilities, categorizing our findings by CWE identifiers to provide a standardized risk taxonomy. We also use YARA to identify malicious patterns and payload signatures. In parallel, we systematically analyze the documentation, API design, and safety mechanisms of each platform to understand their mitigation strategies and enforcement levels. Finally, we conduct a qualitative analysis of over 600 developer discussions from GitHub, Hugging Face, and PyTorch Hub forums, as well as Stack Overflow, to capture community concerns and misconceptions regarding security and usability. Our findings reveal widespread reliance on unsafe defaults, uneven security enforcement across platforms, and persistent confusion among developers about the implications of executing remote code. We conclude with actionable recommendations for designing safer model-sharing infrastructures and striking a balance between usability and security in future AI ecosystems.

Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

Given the user's interest in ML infrastructure and operations, this paper’s investigation into security vulnerabilities within model hosting ecosystems is highly relevant. Understanding potential risks in online inference is crucial for robust MLOps practices.

Multi-Location Software Model Completion

Saarland University

Rate paper: 👍 👎 ♥ Save

AI Insights

The approach relies on pre-trained embeddings, which may not generalize well to different domains or models. (ML: 0.97)👍👎
Multi-location change: A change that affects multiple elements in a software model. (ML: 0.94)👍👎
Graph neural networks have been successfully applied in various software engineering tasks, including code analysis and recommendation systems. (ML: 0.93)👍👎
Previous work has focused on single-location model completion, but NextFocus addresses the more complex and realistic scenario of multi-location changes. (ML: 0.92)👍👎
The paper proposes NextFocus, a novel approach for multi-location model completion that uses graph neural networks and few-shot learning to predict the next focus node in a software model. (ML: 0.92)👍👎
The paper presents a novel approach for multi-location model completion called NextFocus, which uses graph neural networks and few-shot learning to predict the next focus node in a software model. (ML: 0.91)👍👎
The approach generalizes well to a cross-project setting and provides significant performance gains over state-of-the-art single-location model completion approaches. (ML: 0.90)👍👎
Node-ranking problem: Given a recently changed (anchor) node, the model ranks other nodes based on the probability of changing with this anchor node. (ML: 0.90)👍👎
NextFocus outperforms three baselines and achieves high precision for predicting next focus nodes, even when changes are localized or spread across the model. (ML: 0.89)👍👎
The paper assumes that the anchor node is given, which may not be the case in real-world scenarios. (ML: 0.72)👍👎

Abstract
In model-driven engineering and beyond, software models are key development artifacts. In practice, they often grow to substantial size and complexity, undergoing thousands of modifications over time due to evolution, refactoring, and maintenance. The rise of AI has sparked interest in how software modeling activities can be automated. Recently, LLM-based approaches for software model completion have been proposed, however, the state of the art supports only single-location model completion by predicting changes at a specific location. Going beyond, we aim to bridge the gap toward handling coordinated changes that span multiple locations across large, complex models. Specifically, we propose a novel global embedding-based next focus predictor, NextFocus, which is capable of multi-location model completion for the first time. The predictor consists of a neural network with an attention mechanism that is trained on historical software model evolution data. Starting from an existing change, it predicts further model elements to change, potentially spanning multiple parts of the model. We evaluate our approach on multi-location model changes that have actually been performed by developers in real-world projects. NextFocus achieves promising results for multi-location model completion, even when changes are heavily spread across the model. It achieves an average Precision@k score of 0.98 for $k \leq 10$, significantly outperforming the three baseline approaches.

Why we are recommending this paper?
Due to your Interest in Model Monitoring

This paper's focus on software model evolution and maintenance aligns with the user's interest in Data Science Development Tools and MLOps. It addresses the practical challenges of managing complex models over time, a key aspect of the ML lifecycle.

Beyond validation loss: Clinically-tailored optimization metrics improve a model's clinical performance

University of Washington

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The provided document presents two valuable techniques for evaluating machine learning models in scenarios where high specificity is crucial. (ML: 0.98)👍👎
z-scale alignment: A method for normalizing model output scores across different folds to enable direct comparison. (ML: 0.97)👍👎
The provided text is a technical document discussing methods for evaluating machine learning models, specifically in the context of k-fold cross-validation. (ML: 0.97)👍👎
Overall, this document contributes to the development of more robust and accurate methods for evaluating machine learning models, particularly in situations where model performance needs to be evaluated at specific thresholds. (ML: 0.96)👍👎
n% sliver AUC calculation: A technique for computing the area under the ROC curve within a specified specificity range. (ML: 0.96)👍👎
This allows for the evaluation of model performance at specific thresholds, which can be useful in scenarios where high specificity is crucial. (ML: 0.95)👍👎
Z-scale alignment involves normalizing model output scores across different folds to enable direct comparison. (ML: 0.94)👍👎
k-fold cross-validation: A technique used to evaluate the performance of a machine learning model by splitting the available data into k subsets (or folds), training the model on k-1 folds, and testing it on the remaining fold. (ML: 0.94)👍👎
The document also provides Python function definitions for implementing these techniques, including functions for calculating two-sided standard deviations and z-scaling scores. (ML: 0.92)👍👎
This is achieved by calculating two standard deviations (right- and left-handed) for each split, then applying these parameters to map the scores of each split onto a common scale. (ML: 0.91)👍👎
This process is repeated k times, with each fold serving as the test set once. (ML: 0.89)👍👎
The n% sliver AUC calculation method computes the area under the ROC curve within a specified specificity range. (ML: 0.89)👍👎
The document introduces two main techniques: z-scale alignment and n% sliver AUC calculation. (ML: 0.81)👍👎

Abstract
A key task in ML is to optimize models at various stages, e.g. by choosing hyperparameters or picking a stopping point. A traditional ML approach is to use validation loss, i.e. to apply the training loss function on a validation set to guide these optimizations. However, ML for healthcare has a distinct goal from traditional ML: Models must perform well relative to specific clinical requirements, vs. relative to the loss function used for training. These clinical requirements can be captured more precisely by tailored metrics. Since many optimization tasks do not require the driving metric to be differentiable, they allow a wider range of options, including the use of metrics tailored to be clinically-relevant. In this paper we describe two controlled experiments which show how the use of clinically-tailored metrics provide superior model optimization compared to validation loss, in the sense of better performance on the clinical task. The use of clinically-relevant metrics for optimization entails some extra effort, to define the metrics and to code them into the pipeline. But it can yield models that better meet the central goal of ML for healthcare: strong performance in the clinic.

Why we are recommending this paper?
Due to your Interest in Machine Learning Validation

The paper’s exploration of model optimization metrics, particularly in the context of clinical applications, directly relates to the user's interest in Model Validation and Machine Learning Validation. It highlights the importance of going beyond standard validation loss to achieve better clinical outcomes.

Credible CO2 Comparisons: A Machine Learning Approach to Vehicle Powertrain Assessment

National Institute of Metrology, Quality and Technology

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Plots at the scale of individual events confirm that the temporal structure is reproduced with small residuals, while larger discrepancies are limited to rare periods of rapid change. (ML: 0.94)👍👎
Proxy validation shows that replacing measured actuation with EV Feature predictions does not materially change accuracy (median difference ≈ −0.0025 g/s, with proxy MAE no larger than the direct MAE in most trips). (ML: 0.89)👍👎
Model-neutral: A framework that is not specific to any particular model or algorithm, allowing for flexibility and adaptability. (ML: 0.88)👍👎
The method keeps the measured speed profile and shared environment fixed while learning domain-specific mappings for actuation and emissions, aligning both sides on a common instantaneous metric (g/s) and isolating technology effects from confounders. (ML: 0.85)👍👎
The paper introduces a model-neutral counterfactual framework for comparing operational CO2 emissions from internal combustion (ICEV) and electric (EV) powertrains under the same observed driving context. (ML: 0.85)👍👎
Internal combustion engine (ICEV): A type of powertrain that uses fossil fuels to generate energy, producing emissions such as CO2, NOx, and particulate matter. (ML: 0.84)👍👎
Counterfactual: A hypothetical scenario that contrasts with the actual outcome, used to isolate the effect of a particular variable or intervention. (ML: 0.82)👍👎
The framework supports credible comparisons using data from a single instrumented vehicle, enabling pointwise and trip-level gap estimates without simultaneous instrumentation or repeated routes. (ML: 0.76)👍👎
Empirically, the approach is stable and accurate, with ICEV models converging without overfitting and the EV Emissions model attaining low error (MAE ≈0.028 g/s). (ML: 0.76)👍👎
Operational CO2 emissions: The amount of carbon dioxide emitted by a vehicle during its operation, typically measured in grams per second (g/s). (ML: 0.76)👍👎

Abstract
Decarbonizing road transport requires consistent and transparent methods for comparing CO2 emissions across vehicle technologies. This paper proposes a machine learning-based framework for like-for-like operational assessment of internal combustion engine vehicles (ICEVs) and electric vehicles (EVs) under identical, real-world driving conditions. The approach isolates technology-specific effects by holding the observed speed profile and environmental context fixed, enabling direct comparison of powertrain performance. Recurrent neural network models are trained independently for each domain to learn the mapping from contextual driving variables (speed, acceleration, temperature) to internal actuation variables (torque, throttle) and instantaneous CO2-equivalent emission rates. This structure allows the construction of counterfactual scenarios that answer: What emissions would an EV have generated if it had followed the same driving profile as an ICEV? By aligning both vehicle types on a unified instantaneous emissions metric, the framework enables fair and reproducible evaluation of powertrain technologies. It offers a scalable foundation for credible, data-driven assessments of vehicle carbon performance under real-world operating conditions.

Why we are recommending this paper?
Due to your Interest in Machine Learning Validation

This paper’s application of machine learning to vehicle powertrain assessment, specifically focusing on CO2 emissions, is a strong match for the user’s interests in MLOps and Data Science Development Tools. It addresses a real-world problem with a data-driven solution.

ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery

University of Maryland

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The framework also extracts dataset references, keywords, and research objectives from academic papers to construct an evaluation benchmark. (ML: 0.95)👍👎
LLM: Large Language Model Type A: Specific data requests (ML: 0.93)👍👎
The ReSearch framework is designed for Earth science data search and retrieval. (ML: 0.92)👍👎
The ReSearch framework is designed to help users find relevant Earth science datasets for their research goals. (ML: 0.91)👍👎
These prompts are used to classify user queries into Type A (specific data requests) or Type B (broad research goals). (ML: 0.91)👍👎
For Type B queries, the ReSearch framework converts high-level research goals into data-oriented search queries. (ML: 0.90)👍👎
It utilizes Large Language Models (LLMs) for intent classification, query rewriting, and paper information extraction. (ML: 0.89)👍👎
The framework includes several components: intent classification prompt, query rewrite prompt, and paper information extraction prompt. (ML: 0.89)👍👎
The evaluation benchmark includes a list of datasets used in each paper, along with their version numbers, temporal ranges, spatial resolution, and other relevant information. (ML: 0.73)👍👎

Abstract
The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results underscore the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.

Why we are recommending this paper?
Due to your Interest in Machine Learning Deployment

Statistical Learning Theory for Distributional Classification

Technical University of Munich TUM

Rate paper: 👍 👎 ♥ Save

AI Insights

The overall goal of the section on oracle inequalities is to provide a theoretical framework for understanding the performance of learning methods in two-stage sampling. (ML: 0.96)👍👎
The text also discusses the use of comparison functions to derive abstract, quantitative forms of continuity and smoothness assumptions in machine learning. (ML: 0.96)👍👎
The text discusses the use of comparison functions in the theory of machine learning, which is a formalism used in control theory. (ML: 0.95)👍👎
Oracle inequality: An upper bound on the expected risk of a learning method, given the true risk of an optimal solution. (ML: 0.94)👍👎
The problem statement and the provided text are related to machine learning, specifically two-stage sampling and oracle inequalities. (ML: 0.93)👍👎
Two-stage sampling: A method for sampling data where the first stage involves selecting a subset of the population, and the second stage involves drawing a sample from this subset. (ML: 0.92)👍👎
The central ingredient for the foundational oracle inequality is Assumption 20, which provides bounds on the loss function and the Bayes risk. (ML: 0.90)👍👎
The text introduces a class of learning methods called ϵ-approximate clipped regularized empirical risk minimization (ϵ-CR-ERM) and states an oracle inequality for this class. (ML: 0.90)👍👎
The goal of the section on oracle inequalities is to prove Theorem 9, a two-stage sampling variant of (Steinwart and Christmann 2008, Theorem 7.22). (ML: 0.84)👍👎
Comparison functions: A formalism used in control theory to model certain behaviors of bounds. (ML: 0.82)👍👎

Abstract
In supervised learning with distributional inputs in the two-stage sampling setup, relevant to applications like learning-based medical screening or causal learning, the inputs (which are probability distributions) are not accessible in the learning phase, but only samples thereof. This problem is particularly amenable to kernel-based learning methods, where the distributions or samples are first embedded into a Hilbert space, often using kernel mean embeddings (KMEs), and then a standard kernel method like Support Vector Machines (SVMs) is applied, using a kernel defined on the embedding Hilbert space. In this work, we contribute to the theoretical analysis of this latter approach, with a particular focus on classification with distributional inputs using SVMs. We establish a new oracle inequality and derive consistency and learning rate results. Furthermore, for SVMs using the hinge loss and Gaussian kernels, we formulate a novel variant of an established noise assumption from the binary classification literature, under which we can establish learning rates. Finally, some of our technical tools like a new feature space for Gaussian kernels on Hilbert spaces are of independent interest.

Why we are recommending this paper?
Due to your Interest in Machine Learning Operations

Early predicting of hospital admission using machine learning algorithms: Priority queues approach

University of Tasmania

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The study highlights the importance of feature engineering and selection in improving model accuracy, with features such as day of the week, hour of the day, and weather conditions showing significant impact on ED admissions. (ML: 0.99)👍👎
The study presents a comprehensive analysis of various machine learning models and techniques applied to predict emergency department (ED) admissions. (ML: 0.98)👍👎
R-squared: A measure of the goodness of fit of a model, indicating how well the model explains the variation in the dependent variable. (ML: 0.97)👍👎
The authors evaluate the performance of XGBoost, Long Short-Term Memory (LSTM), and other models using metrics such as mean absolute error (MAE) and R-squared. (ML: 0.97)👍👎
Feature engineering and selection play a crucial role in improving model accuracy, with the inclusion of weather conditions and day of the week features significantly enhancing performance. (ML: 0.96)👍👎
The study demonstrates that XGBoost outperforms other models in predicting ED admissions, with an MAE of 12.5 and R-squared of 0.85. (ML: 0.96)👍👎
The study highlights the importance of model interpretability and provides insights into feature contributions using SHAP values. (ML: 0.96)👍👎
LSTM: Long Short-Term Memory, a type of recurrent neural network (RNN) designed to handle sequential data with long-term dependencies. (ML: 0.91)👍👎
XGBoost: Extreme Gradient Boosting, a popular machine learning algorithm for classification and regression tasks. (ML: 0.89)👍👎
MAE: Mean Absolute Error, a metric used to evaluate the accuracy of predictions by calculating the average difference between predicted and actual values. (ML: 0.89)👍👎

Abstract
Emergency Department overcrowding is a critical issue that compromises patient safety and operational efficiency, necessitating accurate demand forecasting for effective resource allocation. This study evaluates and compares three distinct predictive models: Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX), EXtreme Gradient Boosting (XGBoost) and Long Short-Term Memory (LSTM) networks for forecasting daily ED arrivals over a seven-day horizon. Utilizing data from an Australian tertiary referral hospital spanning January 2017 to December 2021, this research distinguishes itself by decomposing demand into eight specific ward categories and stratifying patients by clinical complexity. To address data distortions caused by the COVID-19 pandemic, the study employs the Prophet model to generate synthetic counterfactual values for the anomalous period. Experimental results demonstrate that all three proposed models consistently outperform a seasonal naive baseline. XGBoost demonstrated the highest accuracy for predicting total daily admissions with a Mean Absolute Error of 6.63, while the statistical SARIMAX model proved marginally superior for forecasting major complexity cases with an MAE of 3.77. The study concludes that while these techniques successfully reproduce regular day-to-day patterns, they share a common limitation in underestimating sudden, infrequent surges in patient volume.

Why we are recommending this paper?
Due to your Interest in Machine Learning Operations

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Hong Kong Polytechnic University

Rate paper: 👍 👎 ♥ Save

AI Insights

The task granularity is flexible, and every reasoning chain must start from the raw data or a logically prior step. (ML: 0.97)👍👎
The instructions may be too complex or detailed for some users, potentially leading to confusion. (ML: 0.95)👍👎
The provided Jupyter Notebook content is a template for generating data science questions based on an answered notebook. (ML: 0.95)👍👎
QRA: Question-Reasoning-Answer triplet JSON: JavaScript Object Notation Generating high-quality data science questions based on an answered notebook requires careful analysis and adherence to specific guidelines. (ML: 0.94)👍👎
The output format requires a valid JSON object with specific keys such as 'data_type', 'domain', 'task_type', 'language', 'question', 'reasoning', 'answer', 'best_score (Optional)', and 'confidence'. (ML: 0.89)👍👎
The final output must be a valid JSON object with the specified structure. (ML: 0.82)👍👎
The instructions provide detailed guidelines for generating QRA triplets, including the importance of not mentioning the notebook and ensuring diversity across task types. (ML: 0.79)👍👎
The output format must conform to a valid JSON object with specified keys, ensuring that the generated QRA triplets are accurate and comprehensive. (ML: 0.77)👍👎

Abstract
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.

Why we are recommending this paper?
Due to your Interest in Data Science Development Tools

SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories

EDCMU

Rate paper: 👍 👎 ♥ Save

AI Insights

The text does not provide a comprehensive overview of the current state of research on using LLMs for software engineering tasks. (ML: 0.95)👍👎
Some of the key findings from these papers include the use of fine-tuning techniques to adapt pre-trained LLMs to specific software engineering tasks, the development of new architectures and models for software engineering applications, and the evaluation of the effectiveness of LLM-based approaches in various scenarios. (ML: 0.95)👍👎
LLM: Large Language Model Software Engineering: The application of engineering principles to design, develop, test, and maintain software systems. (ML: 0.94)👍👎
Further research is needed to fully explore the potential of LLM-based approaches in software engineering. (ML: 0.94)👍👎
The provided text is a collection of research papers and technical reports related to software engineering, specifically on the topic of using large language models (LLMs) for issue reproduction and test generation. (ML: 0.93)👍👎
The text includes several research papers that present various approaches to using LLMs for software engineering tasks such as bug reproduction, test generation, and code completion. (ML: 0.91)👍👎
The use of LLMs has shown promise in various software engineering tasks, including issue reproduction and test generation. (ML: 0.90)👍👎
Test Generation: The process of automatically generating test cases for a software system. (ML: 0.89)👍👎
Issue Reproduction: The process of creating a test case that reproduces a specific issue or bug in a software system. (ML: 0.85)👍👎

Abstract
Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root cause analysis, promotes test-driven development -- "test first, write code later", and can be used for improving the effectiveness of automated issue resolution systems like coding agents. Existing methods proposed for this task predominantly rely on closed-source LLMs, with limited exploration of open models. To address this, we propose SWE-Tester -- a novel pipeline for training open-source LLMs to generate issue reproduction tests. First, we curate a high-quality training dataset of 41K instances from 2.6K open-source GitHub repositories and use it to train LLMs of varying sizes and families. The fine-tuned models achieve absolute improvements of up to 10\% in success rate and 21\% in change coverage on SWT-Bench Verified. Further analysis shows consistent improvements with increased inference-time compute, more data, and larger models. These results highlight the effectiveness of our framework for advancing open-source LLMs in this domain.

Why we are recommending this paper?
Due to your Interest in Data Science Development Tools

Tabular Incremental Inference

Fudan University

Rate paper: 👍 👎 ♥ Save

AI Insights

Further research is needed to explore the potential applications and limitations of this approach. (ML: 0.96)👍👎
Information bottleneck principle: A concept in information theory that describes how to compress information while preserving its essential features. (ML: 0.95)👍👎
The paper assumes that the labeled dataset is available for training, which may not always be the case. (ML: 0.95)👍👎
The paper proposes a new method for handling tabular data with large language models (LLMs). (ML: 0.94)👍👎
Large Language Models (LLMs): Neural networks that can process and generate human-like text, often used for natural language processing tasks. (ML: 0.94)👍👎
The use of self-generated tasks from unlabeled tables can improve the performance of LLMs on tabular data. (ML: 0.93)👍👎
The authors introduce the concept of 'information bottleneck' and its application in deep learning. (ML: 0.92)👍👎
Tabular data: Data stored in tables with rows and columns, often used for statistical analysis or machine learning. (ML: 0.91)👍👎
They propose a framework called Tabllm, which uses self-generated tasks from unlabeled tables to improve few-shot classification performance. (ML: 0.86)👍👎
The proposed method, Tabllm, shows promising results in few-shot classification tasks on tabular data. (ML: 0.84)👍👎

Abstract
Tabular data is a fundamental form of data structure. The evolution of table analysis tools reflects humanity's continuous progress in data acquisition, management, and processing. The dynamic changes in table columns arise from technological advancements, changing needs, data integration, etc. However, the standard process of training AI models on tables with fixed columns and then performing inference is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in an unsupervised manner. In this paper, we introduce a new task, Tabular Incremental Inference (TabII), which aims to enable trained models to incorporate new columns during the inference stage, enhancing the practicality of AI models in scenarios where tables are dynamically changed. Furthermore, we demonstrate that this new task can be framed as an optimization problem based on the information bottleneck theory, which emphasizes that the key to an ideal tabular incremental inference approach lies in minimizing mutual information between tabular data and representation while maximizing between representation and task labels. Under this guidance, we design a TabII method with Large Language Model placeholders and Pretrained TabAdapter to provide external knowledge and Incremental Sample Condensation blocks to condense the task-relevant information given by incremental column attributes. Experimental results across eight public datasets show that TabII effectively utilizes incremental attributes, achieving state-of-the-art performance.

Why we are recommending this paper?
Due to your Interest in Online inference

CLASP: An online learning algorithm for Convex Losses And Squared Penalties

NOVA School of Science and Technology

Rate paper: 👍 👎 ♥ Save

AI Insights

Online Linear Regression: A problem in which the goal is to predict the output of a linear regression model based on input data. (ML: 0.95)👍👎
(2017), who proposed a framework for solving online linear regression problems with constraints on the decision variable. (ML: 0.94)👍👎
The algorithm then updates its decision variable by minimizing the loss function while satisfying the revealed constraints. (ML: 0.90)👍👎
The algorithm then updates its decision variable by minimizing the loss function while satisfying the revealed constraints. (ML: 0.90)👍👎
Imagine you're trying to solve an optimization problem, but there are some constraints that need to be satisfied at each iteration. (ML: 0.88)👍👎
The Switch algorithm cannot be applied when the feasibility property is not satisfied. (ML: 0.87)👍👎
The COCO framework is like a tool that helps you find the best solution by revealing new constraint functions at each step and updating your decision variable accordingly. (ML: 0.82)👍👎
The framework is based on the concept of revealed constraints, where at each iteration, a new constraint function is revealed to the algorithm. (ML: 0.82)👍👎
The framework is based on the concept of revealed constraints, where at each iteration, a new constraint function is revealed to the algorithm. (ML: 0.82)👍👎
The authors mention several related works on online convex optimization (OCO) and constrained optimization problems, including the work of Hazan et al. (ML: 0.81)👍👎
The paper presents a new online convex optimization (OCO) framework called COCO, which allows for efficient and accurate solution of constrained optimization problems. (ML: 0.70)👍👎
The paper presents a new online convex optimization (OCO) framework called COCO, which allows for efficient and accurate solution of constrained optimization problems. (ML: 0.70)👍👎
COCO (Constrained Online Convex Optimization): A framework for solving constrained optimization problems online, where at each iteration, a new constraint function is revealed to the algorithm. (ML: 0.70)👍👎
(2016), who proposed a framework for solving constrained OCO problems using a projected gradient descent method. (ML: 0.66)👍👎
The authors also mention the work of Cuturi et al. (ML: 0.61)👍👎
The paper presents a new OCO framework called COCO, which allows for efficient and accurate solution of constrained optimization problems. (ML: 0.61)👍👎

Abstract
We study Constrained Online Convex Optimization (COCO), where a learner chooses actions iteratively, observes both unanticipated convex loss and convex constraint, and accumulates loss while incurring penalties for constraint violations. We introduce CLASP (Convex Losses And Squared Penalties), an algorithm that minimizes cumulative loss together with squared constraint violations. Our analysis departs from prior work by fully leveraging the firm non-expansiveness of convex projectors, a proof strategy not previously applied in this setting. For convex losses, CLASP achieves regret $O\left(T^{\max\{β,1-β\}}\right)$ and cumulative squared penalty $O\left(T^{1-β}\right)$ for any $β\in (0,1)$. Most importantly, for strongly convex problems, CLASP provides the first logarithmic guarantees on both regret and cumulative squared penalty. In the strongly convex case, the regret is upper bounded by $O( \log T )$ and the cumulative squared penalty is also upper bounded by $O( \log T )$.

Why we are recommending this paper?
Due to your Interest in Online inference

Unified Multi-Dataset Training for TBPS

Indian Institute of Information Technology IIIT Delhi

Rate paper: 👍 👎 ♥ Save

AI Insights

Recent works have shown that combining vision and language models can achieve state-of-the-art performance in TPR tasks. (ML: 0.93)👍👎
The authors do not provide a detailed comparison with existing methods in terms of computational cost. (ML: 0.93)👍👎
The proposed method uses a filtering-wora paradigm for efficient text-based person search, which includes three stages: filtering, scoring, and ranking. (ML: 0.90)👍👎
TPR: Text-based person re-identification WORA: Weighted Ordinal Ranking Algorithm The proposed approach achieves significant improvements over existing methods, demonstrating the effectiveness of combining vision and language models for TPR. (ML: 0.90)👍👎
The filtering-wora paradigm is a promising direction for efficient text-based person search, offering a trade-off between accuracy and computational cost. (ML: 0.90)👍👎
The paper proposes a novel approach to text-based person re-identification (TPR) that combines the strengths of both vision and language models. (ML: 0.85)👍👎
The authors evaluate their approach on several benchmarks, including the Ultra-Fine Granularity Benchmark (UFGB), and achieve state-of-the-art performance in terms of accuracy and efficiency. (ML: 0.81)👍👎
The evaluation on the Ultra-Fine Granularity Benchmark (UFGB) is limited to a single experiment, and more comprehensive evaluations are needed. (ML: 0.80)👍👎

Abstract
Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.

Why we are recommending this paper?
Due to your Interest in Machine Learning Infrastructure

Break-Resilient Codes with Loss Tolerance

Washington University in St Louis

Rate paper: 👍 👎 ♥ Save

AI Insights

Property (I): The probability that any two length-(clogm)substrings zi and zj of z are equal is at most m2/2clogm = m−c+2. (ML: 0.89)👍👎
Property (II): The probability that an interval of length clogm of z matches one of the markers is (L+1)m−c. (ML: 0.88)👍👎
Valid binary string: A binary string z that satisfies two properties: Property (I) and Property (II). (ML: 0.86)👍👎
The authors provide a formal proof that the success probability of choosing a valid binary string z is at least 1−1/poly(m). (ML: 0.80)👍👎
The paper also discusses related work on coding over sets for DNA storage, torn-paper coding, and recovering a message from an incomplete set of noisy fragments. (ML: 0.78)👍👎
Break-resilient code (BRC): A code scheme that can recover a message from an incomplete set of noisy fragments. (ML: 0.72)👍👎
The scheme has redundancy of O(((2·M+1)·L)=O(tlog2n+slogn). (ML: 0.69)👍👎
The scheme has redundancy of O(((2·M+1)·L)=O(tlog2n+slogn). (ML: 0.69)👍👎
The (t,s)-BRC code scheme can recover a message from an incomplete set of noisy fragments with high probability. (ML: 0.65)👍👎
The paper presents a break-resilient code scheme called the (t,s)-BRC code, which can recover a message from an incomplete set of noisy fragments. (ML: 0.64)👍👎

Abstract
Emerging applications in manufacturing, wireless communication, and molecular data storage require robust coding schemes that remain effective under physical distortions where codewords may be arbitrarily fragmented and partially missing. To address such challenges, we propose a new family of error-correcting codes, termed $(t,s)$-break-resilient codes ($(t,s)$-BRCs). A $(t,s)$-BRC guarantees correct decoding of the original message even after up to~$t$ arbitrary breaks of the codeword and the complete loss of some fragments whose total length is at most~$s$. This model unifies and generalizes previous approaches, extending break-resilient codes (which handle arbitrary fragmentation without fragment loss) and deletion codes (which correct bit losses in unknown positions without fragmentation) into a single information-theoretic framework. We develop a theoretical foundation for $(t,s)$-BRCs, including a formal adversarial channel model, lower bounds on the necessary redundancy, and explicit code constructions that approach these bounds.

Why we are recommending this paper?
Due to your Interest in Fault tolerance

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Anthropic

Rate paper: 👍 👎 ♥ Save

AI Insights

The paper discusses the potential misuse of large language models (LLMs) by adversaries, highlighting the need for robust security measures. (ML: 0.93)👍👎
LLM: Large Language Model Adversary: An individual or entity attempting to exploit or manipulate a system for malicious purposes Misuse: The unauthorized use of a system or technology for unintended or harmful purposes The study highlights the need for robust security measures to prevent the misuse of LLMs. (ML: 0.93)👍👎
The study may not provide a comprehensive solution to preventing LLM misuse, as it focuses on specific methods for defending against adversarial attacks. (ML: 0.92)👍👎
The study emphasizes the importance of developing more secure and transparent AI systems that can prevent misuse and ensure safety. (ML: 0.87)👍👎
Developing more secure and transparent AI systems is crucial to ensure safety and prevent potential harm. (ML: 0.85)👍👎
Researchers have proposed various methods to defend against adversarial attacks on LLMs, including model ensemble adversarial attack and prompt decomposition and reconstruction. (ML: 0.82)👍👎
The development of more secure and transparent AI systems is a complex task that requires significant resources and expertise. (ML: 0.79)👍👎

Abstract
Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.

Why we are recommending this paper?
Due to your Interest in Fault tolerance

Feature-Aware Test Generation for Deep Learning Models

Technical University of Munich

Rate paper: 👍 👎 ♥ Save

AI Insights

VLM (Vision-Language Model): A type of model that integrates computer vision and natural language processing capabilities, used for tasks such as image captioning, visual question answering, and image generation. (ML: 0.97)👍👎
The study relies on a specific set of VLMs for the experiments, which may not generalize to other models or tasks. (ML: 0.97)👍👎
VLMs have shown promise in various tasks such as image captioning, visual question answering, and image generation. (ML: 0.94)👍👎
The study does not provide a comprehensive evaluation of the approach's robustness and scalability. (ML: 0.92)👍👎
Spurious feature detection is an important aspect of deep learning model evaluation. (ML: 0.92)👍👎
The study demonstrates the effectiveness of Feature-Aware Test Generation in generating targeted and visually precise manipulations, outperforming existing methods in terms of runtime, e𝑑2-Image distance, and MS-SSIM. (ML: 0.92)👍👎
The approach requires significant computational resources and expertise in deep learning. (ML: 0.90)👍👎
Existing methods for test generation often rely on either feature-based or behavior-driven testing, but not both. (ML: 0.88)👍👎
Feature-Aware Test Generation: A novel approach that combines the strengths of feature-based and behavior-driven testing to generate targeted and visually precise manipulations. (ML: 0.87)👍👎
The study proposes a novel approach for test generation, called Feature-Aware Test Generation, which leverages the strengths of both feature-based and behavior-driven testing. (ML: 0.82)👍👎
The approach also shows promise in spurious feature detection, identifying more task-relevant influential inputs and channels compared to a fully fine-tuned ResNet50 and a frozen-backbone SWAG ViT. (ML: 0.79)👍👎

Abstract
As deep learning models are widely used in software systems, test generation plays a crucial role in assessing the quality of such models before deployment. To date, the most advanced test generators rely on generative AI to synthesize inputs; however, these approaches remain limited in providing semantic insight into the causes of misbehaviours and in offering fine-grained semantic controllability over the generated inputs. In this paper, we introduce Detect, a feature-aware test generation framework for vision-based deep learning (DL) models that systematically generates inputs by perturbing disentangled semantic attributes within the latent space. Detect perturbs individual latent features in a controlled way and observes how these changes affect the model's output. Through this process, it identifies which features lead to behavior shifts and uses a vision-language model for semantic attribution. By distinguishing between task-relevant and irrelevant features, Detect applies feature-aware perturbations targeted for both generalization and robustness. Empirical results across image classification and detection tasks show that Detect generates high-quality test cases with fine-grained control, reveals distinct shortcut behaviors across model architectures (convolutional and transformer-based), and bugs that are not captured by accuracy metrics. Specifically, Detect outperforms a state-of-the-art test generator in decision boundary discovery and a leading spurious feature localization method in identifying robustness failures. Our findings show that fully fine-tuned convolutional models are prone to overfitting on localized cues, such as co-occurring visual traits, while weakly supervised transformers tend to rely on global features, such as environmental variances. These findings highlight the value of interpretable and feature-aware testing in improving DL model reliability.

Why we are recommending this paper?
Due to your Interest in Machine Learning Testing

HyperNet-Adaptation for Diffusion-Based Test Case Generation

Technical University of Munich

Rate paper: 👍 👎 ♥ Save

AI Insights

Misclassification Rate: The proportion of test cases that induce failures in the SUT. (ML: 0.97)👍👎
Diffusion Model: A type of generative model that uses a Markov chain to model the data distribution. (ML: 0.95)👍👎
The approach also shows better efficiency compared to baselines, with lower runtime and budget usage across all tasks. (ML: 0.94)👍👎
HyNeA outperforms baselines in MS-SSIM, embedding diversity of generated test cases, and trace difference between original–target pairs across all tasks. (ML: 0.92)👍👎
Escape Ratio: The proportion of generated test cases that do not trigger their targeted misbehavior. (ML: 0.91)👍👎
HyperNet: A neural network placed on top of a selected diffusion model to adapt the weights of the model. (ML: 0.90)👍👎
MS-SSIM: A metric for measuring structural similarity between two images. (ML: 0.89)👍👎
HyNeA's effectiveness is demonstrated through its ability to consistently induce failures in the SUT, even when evaluation is restricted to the top-5 detections in object detection task. (ML: 0.88)👍👎
HyNeA is a novel approach for generating failure-inducing perturbations in deep neural networks, which consistently induces failures across all tasks (misclassification rate =1), maintains better control over targeted misbehavior (low escape ratio), and produces test cases that remain closer to the data distribution while achieving higher structural similarity and more stable embeddings. (ML: 0.87)👍👎
The approach uses a HyperNet placed on top of a selected diffusion model to adapt the weights of the model, allowing for more precise control over the generation process. (ML: 0.74)👍👎

Abstract
The increasing deployment of deep learning systems requires systematic evaluation of their reliability in real-world scenarios. Traditional gradient-based adversarial attacks introduce small perturbations that rarely correspond to realistic failures and mainly assess robustness rather than functional behavior. Generative test generation methods offer an alternative but are often limited to simple datasets or constrained input domains. Although diffusion models enable high-fidelity image synthesis, their computational cost and limited controllability restrict their applicability to large-scale testing. We present HyNeA, a generative testing method that enables direct and efficient control over diffusion-based generation. HyNeA provides dataset-free controllability through hypernetworks, allowing targeted manipulation of the generative process without relying on architecture-specific conditioning mechanisms or dataset-driven adaptations such as fine-tuning. HyNeA employs a distinct training strategy that supports instance-level tuning to identify failure-inducing test cases without requiring datasets that explicitly contain examples of similar failures. This approach enables the targeted generation of realistic failure cases at substantially lower computational cost than search-based methods. Experimental results show that HyNeA improves controllability and test diversity compared to existing generative test generators and generalizes to domains where failure-labeled training data is unavailable.

Why we are recommending this paper?
Due to your Interest in Machine Learning Testing

CREATE: Cross-Layer Resilience Characterization and Optimization for Efficient yet Reliable Embodied AI Systems

Peking University

Rate paper: 👍 👎 ♥ Save

AI Insights

It also demonstrates robustness and adaptability across diverse tasks and embodied AI platforms. (ML: 0.96)👍👎
The planner is responsible for task planning, while the controller executes the planned tasks. (ML: 0.91)👍👎
CREATE aims to improve the reliability, efficiency, and adaptability of embodied AI systems by mitigating the effects of timing errors on deep neural networks. (ML: 0.90)👍👎
Embodied AI: A field of research that focuses on developing intelligent systems that can interact with and adapt to their environment. (ML: 0.87)👍👎
Anomaly Detection (AD): A technique used in CREATE to detect timing errors in deep neural networks. (ML: 0.84)👍👎
The framework achieves significant improvements in task performance and system efficiency compared to state-of-the-art methods. (ML: 0.83)👍👎
Weight Repair (WR): A module in CREATE that repairs weights affected by timing errors. (ML: 0.82)👍👎
Dynamic Voltage Scaling (VS): A system in CREATE that dynamically adjusts the supply voltage of the controller model based on real-time entropy predictions. (ML: 0.79)👍👎
The CREATE framework is a hardware-software co-design approach for embodied AI systems that integrates three core techniques: Anomaly Detection (AD), Weight Repair (WR), and Dynamic Voltage Scaling (VS). (ML: 0.76)👍👎
CREATE: A hardware-software co-design approach for embodied AI systems that integrates three core techniques: Anomaly Detection (AD), Weight Repair (WR), and Dynamic Voltage Scaling (VS). (ML: 0.73)👍👎
The dynamic voltage scaling system required in CREATE is supported with a distributed LDO design. (ML: 0.70)👍👎
The framework consists of a planner and a controller, each with its own AD unit, WR module, and VS system. (ML: 0.66)👍👎
CREATE's anomaly detection units are appended to systolic arrays with 128x128 PEs, which are composed of an 8-bit multiplier and a 24-bit accumulator. (ML: 0.60)👍👎

Abstract
Embodied Artificial Intelligence (AI) has recently attracted significant attention as it bridges AI with the physical world. Modern embodied AI systems often combine a Large Language Model (LLM)-based planner for high-level task planning and a reinforcement learning (RL)-based controller for low-level action generation, enabling embodied agents to tackle complex tasks in real-world environments. However, deploying embodied agents remains challenging due to their high computation requirements, especially for battery-powered local devices. Although techniques like lowering operating voltage can improve energy efficiency, they can introduce bit errors and result in task failures. In this work, we propose CREATE, a general design principle that leverages heterogeneous resilience at different layers for synergistic energy-reliability co-optimization. For the first time, we conduct a comprehensive error injection study on modern embodied AI systems and observe an inherent but heterogeneous fault tolerance. Building upon these insights, we develop an anomaly detection and clearance mechanism at the circuit level to eliminate outlier errors. At the model level, we propose a weight-rotation-enhanced planning algorithm to improve the fault tolerance of the LLM-based planner. Furthermore, we introduce an application-level technique, autonomy-adaptive voltage scaling, to dynamically adjust the operating voltage of the controllers. The voltage scaling circuit is co-designed to enable online voltage adjustment. Extensive experiments demonstrate that without compromising task quality, CREATE achieves 40.6% computational energy savings on average over nominal-voltage baselines and 35.0% over prior-art techniques. This further leads to 29.5% to 37.3% chip-level energy savings and approximately a 15% to 30% improvement in battery life.

Why we are recommending this paper?
Due to your Interest in Machine Learning Resilience

On damage of interpolation to adversarial robustness in regression

Tsinghua University

Rate paper: 👍 👎 ♥ Save

AI Insights

Interpolating estimator: An estimator that fits the training data exactly, i.e., ˆf(X_i) = ξ_i for all i = 1, ..., n. (ML: 0.94)👍👎
Adversarial robustness: The ability of an estimator to maintain its performance under future adversarial attacks or covariates measured with errors. (ML: 0.92)👍👎
The results suggest that interpolating estimators may compromise robustness, and therefore, regularization techniques should be used to mitigate overfitting in neural network training. (ML: 0.91)👍👎
Further research is needed to develop theoretically grounded methods for enhancing the robustness of interpolating DNNs. (ML: 0.90)👍👎
Minimax risk: The minimum expected loss over all possible functions f^* in a function class H(β, L). (ML: 0.90)👍👎
Interpolating estimators fail to attain the minimax optimal adversarial rate. (ML: 0.88)👍👎
The study highlights the importance of considering adversarial robustness when training deep neural networks. (ML: 0.86)👍👎
Highly interpolating estimators can depart from the standard nonparametric rate at perturbation magnitudes much smaller than those of regular estimators. (ML: 0.82)👍👎
Achieving adversarial robustness may require stopping gradient descent considerably earlier than the point that minimizes the standard risk. (ML: 0.78)👍👎
The phase transition of interpolators departing from the standard nonparametric rate can occur at perturbation magnitudes much smaller than those of regular estimators. (ML: 0.74)👍👎

Abstract
Deep neural networks (DNNs) typically involve a large number of parameters and are trained to achieve zero or near-zero training error. Despite such interpolation, they often exhibit strong generalization performance on unseen data, a phenomenon that has motivated extensive theoretical investigations. Comforting results show that interpolation indeed may not affect the minimax rate of convergence under the squared error loss. In the mean time, DNNs are well known to be highly vulnerable to adversarial perturbations in future inputs. A natural question then arises: Can interpolation also escape from suboptimal performance under a future $X$-attack? In this paper, we investigate the adversarial robustness of interpolating estimators in a framework of nonparametric regression. A finding is that interpolating estimators must be suboptimal even under a subtle future $X$-attack, and achieving perfect fitting can substantially damage their robustness. An interesting phenomenon in the high interpolation regime, which we term the curse of simple size, is also revealed and discussed. Numerical experiments support our theoretical findings.

Why we are recommending this paper?
Due to your Interest in Machine Learning Resilience

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

Data Science Development Environment and Productivity
MLOps

You can edit or add more interests any time.

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback