Hi!

Your personalized paper recommendations for 10 to 14 November, 2025.

🎯 Top Personalized Recommendations

Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance

Tsinghua University

Why we think this paper is great for you:
This paper directly addresses the crucial challenge of ensuring reliability and identifying failures in complex multi-agent systems. It offers valuable insights into Byzantine Fault Tolerance, which is highly relevant for building robust ML infrastructure.

Rate paper: 👍 👎 ♥ Save

Abstract
Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7\% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.

AI Summary

Hidden-level Confidence Probing (HCP) consistently outperforms Prompt-level Confidence Probing (PCP) and single-token extraction methods, demonstrating that decoder-level semantic consistency signals are superior for robust confidence assessment. [3]
LLM-based agents inherently possess stronger skepticism towards erroneous information, enabling them to significantly outperform traditional agents in Byzantine fault tolerance across various network topologies. [2]
The proposed CP-WBFT mechanism effectively enhances MAS reliability by leveraging LLM's intrinsic reflective and discriminative capabilities through confidence-guided weighted information flow. [2]
CP-WBFT, particularly with Hidden-level Confidence Probing (HCP), achieves remarkable Byzantine fault tolerance, maintaining 100% final accuracy even under extreme conditions (85.7% fault rate) in well-connected topologies like complete graphs. [2]
Network topology critically influences consensus effectiveness, with complete graphs maximizing information flow for optimal performance, while constrained topologies limit consensus due to restricted information exchange. [2]
LLM-based multi-agent systems can exceed the classical Byzantine fault tolerance bound of f < n/3, tolerating a much higher proportion of malicious nodes than traditional systems. [2]
Safety assessment tasks (XSTest) exhibit higher topology dependence for effective consensus compared to mathematical reasoning tasks (GSM8K), which show more topology-agnostic robustness. [2]
Byzantine Fault Tolerance (BFT) in LLM-based MAS: The ability of multi-agent systems composed of LLMs to achieve consensus and maintain reliability despite the presence of malicious or arbitrarily faulty LLM agents. [2]
CP-WBFT (Confidence Probe-based Weighted Byzantine Fault Tolerant consensus mechanism): A novel consensus protocol that enhances MAS stability by dynamically assigning information weights based on agents' confidence levels, derived from either prompt-level or hidden-level probes. [2]
Prompt-level Confidence Probe (PCP): A method to explicitly elicit and quantify an LLM agent's confidence in its response through structured prompting strategies, leveraging the LLM's self-reflective capabilities. [2]

Fault Tolerant Reconfigurable ML Multiprocessor

Temple University

Why we think this paper is great for you:
You will find this paper highly relevant as it explores fault-tolerant reconfigurable multiprocessor architectures for neural network training. It directly contributes to understanding resilient machine learning infrastructure.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
This paper reports three computational experiments for a von Neumann inspired reconfigurable fault tolerant multiprocessor for neural network (NN) training workflows. The experiments are intended to prove the feasibility of the proposed reconfigurable multiprocessor architecture for non-regular workflows on robustness of adaptability. A potential integration with MLIR compilers is also discussed for integrating diverse accelerator hardware for existing practical applications.

Resilient by Design -- Active Inference for Distributed Continuum Intelligence

Why we think this paper is great for you:
This paper focuses on designing resilient systems that can withstand failures in distributed computing environments. It provides key insights into ensuring reliability and fault tolerance for machine learning operations.

Rate paper: 👍 👎 ♥ Save

Abstract
Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free-energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.

Test-driven Reinforcement Learning

Xian Jiaotong University

Why we think this paper is great for you:
This paper introduces a test-driven approach to reinforcement learning, which is directly applicable to your interest in machine learning testing methodologies. It offers a practical perspective on validating learning agents.

Rate paper: 👍 👎 ♥ Save

Abstract
Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.

Learning to Validate Generative Models: a Goodness-of-Fit Approach

Why we think this paper is great for you:
You will appreciate this paper's focus on rigorous validation techniques for generative models, addressing challenges in scalability and statistical power. It provides a strong foundation for machine learning validation practices.

Rate paper: 👍 👎 ♥ Save

Abstract
Generative models are increasingly central to scientific workflows, yet their systematic use and interpretation require a proper understanding of their limitations through rigorous validation. Classic approaches struggle with scalability, statistical power, or interpretability when applied to high-dimensional data, making it difficult to certify the reliability of these models in realistic, high-dimensional scientific settings. Here, we propose the use of the New Physics Learning Machine (NPLM), a learning based approach to goodness-of-fit testing inspired by the Neyman-Pearson construction, to test generative networks trained on high-dimensional scientific data. We demonstrate the performance of NPLM for validation in two benchmark cases: generative models trained on mixtures of Gaussian models with increasing dimensionality, and a public end-to-end generator for the Large Hadron Collider called FlashSim, trained on jet data, typical in the field of high-energy physics. We demonstrate that the NPLM can serve as a powerful validation method while also providing a means to diagnose sub-optimally modeled regions of the data.

Investigating CoT Monitorability in Large Reasoning Models

KAUST

Why we think this paper is great for you:
This paper delves into the monitorability of large reasoning models, offering crucial insights into how to track and ensure the safety of complex AI systems. It directly supports your interest in robust model monitoring.

Rate paper: 👍 👎 ♥ Save

Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models' long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models' misbehavior through their CoT and provide structured judgments along with supporting evidence.

ML-EcoLyzer: Quantifying the Environmental Cost of Machine Learning Inference Across Frameworks and Hardware

Philippines

Why we think this paper is great for you:
This paper offers a practical tool for quantifying the environmental impact of machine learning inference across various hardware and frameworks. It provides valuable data for optimizing MLOps and infrastructure for online inference.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Machine learning inference occurs at a massive scale, yet its environmental impact remains poorly quantified, especially on low-resource hardware. We present ML-EcoLyzer, a cross-framework tool for measuring the carbon, energy, thermal, and water costs of inference across CPUs, consumer GPUs, and datacenter accelerators. The tool supports both classical and modern models, applying adaptive monitoring and hardware-aware evaluation. We introduce the Environmental Sustainability Score (ESS), which quantifies the number of effective parameters served per gram of CO$_2$ emitted. Our evaluation covers over 1,900 inference configurations, spanning diverse model architectures, task modalities (text, vision, audio, tabular), hardware types, and precision levels. These rigorous and reliable measurements demonstrate that quantization enhances ESS, huge accelerators can be inefficient for lightweight applications, and even small models may incur significant costs when implemented suboptimally. ML-EcoLyzer sets a standard for sustainability-conscious model selection and offers an extensive empirical evaluation of environmental costs during inference.

Machine Learning Resilience

Misaligned by Design: Incentive Failures in Machine Learning

MIT

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
The cost of error in many high-stakes settings is asymmetric: misdiagnosing pneumonia when absent is an inconvenience, but failing to detect it when present can be life-threatening. Because of this, artificial intelligence (AI) models used to assist such decisions are frequently trained with asymmetric loss functions that incorporate human decision-makers' trade-offs between false positives and false negatives. In two focal applications, we show that this standard alignment practice can backfire. In both cases, it would be better to train the machine learning model with a loss function that ignores the human's objective and then adjust predictions ex post according to that objective. We rationalize this result using an economic model of incentive design with endogenous information acquisition. The key insight from our theoretical framework is that machine classifiers perform not one but two incentivized tasks: choosing how to classify and learning how to classify. We show that while the adjustments engineers use correctly incentivize choosing, they can simultaneously reduce the incentives to learn. Our formal treatment of the problem reveals that methods embraced for their intuitive appeal can in fact misalign human and machine objectives in predictable ways.

Machine Learning Testing

Sequential Adversarial Hypothesis Testing

Rate paper: 👍 👎 ♥ Save

Abstract
We study the adversarial binary hypothesis testing problem in the sequential setting. Associated with each hypothesis is a closed, convex set of distributions. Given the hypothesis, each observation is generated according to a distribution chosen (from the set associated with the hypothesis) by an adversary who has access to past observations. In the sequential setting, the number of observations the detector uses to arrive at a decision is variable; this extra freedom improves the asymptotic performance of the test. We characterize the closure of the set of achievable pairs of error exponents. We also study the problem under constraints on the number of observations used and the probability of error incurred.

Data Science Development Environment and Productivity

wdiexplorer: An R package Designed for Exploratory Analysis of World Development Indicators (WDI) Data

Maynooth University

Rate paper: 👍 👎 ♥ Save

Abstract
The World Development Indicators (WDI) database provides a wide range of global development data, maintained and published by the World Bank. Our \textit{wdiexplorer} package offers a comprehensive workflow that sources WDI data via the \textit{WDI} R package, prepares and explores country-level panel data of the WDI through computational functions to calculate diagnostic metrics and visualise the outputs. By leveraging the functionalities of \textit{wdiexplorer} package, users can efficiently explore any indicator dataset of the WDI, compute diagnostic indices, and visualise the metrics by incorporating the pre-defined grouping structures to identify patterns, outliers, and other interesting features of temporal behaviours. This paper presents the \textit{wdiexplorer} package, demonstrates its functionalities using the WDI: PM$_{2.5}$ air pollution dataset, and discusses the observed patterns and outliers across countries and within groups of country-level panel data.

AI-Powered Data Visualization Platform: An Intelligent Web Application for Automated Dataset Analysis

Presidency University

Rate paper: 👍 👎 ♥ Save

Abstract
An AI-powered data visualization platform that automates the entire data analysis process, from uploading a dataset to generating an interactive visualization. Advanced machine learning algorithms are employed to clean and preprocess the data, analyse its features, and automatically select appropriate visualizations. The system establishes the process of automating AI-based analysis and visualization from the context of data-driven environments, and eliminates the challenge of time-consuming manual data analysis. The combination of a Python Flask backend to access the dataset, paired with a React frontend, provides a robust platform that automatically interacts with Firebase Cloud Storage for numerous data processing and data analysis solutions and real-time sources. Key contributions include automatic and intelligent data cleaning, with imputation for missing values, and detection of outliers, via analysis of the data set. AI solutions to intelligently select features, using four different algorithms, and intelligent title generation and visualization are determined by the attributes of the dataset. These contributions were evaluated using two separate datasets to assess the platform's performance. In the process evaluation, the initial analysis was performed in real-time on datasets as large as 100000 rows, while the cloud-based demand platform scales to meet requests from multiple users and processes them simultaneously. In conclusion, the cloud-based data visualization application allowed for a significant reduction of manual inputs to the data analysis process while maintaining a high quality, impactful visual outputs, and user experiences

Machine Learning Infrastructure

How Worrying Are Privacy Attacks Against Machine Learning?

URV

Rate paper: 👍 👎 ♥ Save

Abstract
In several jurisdictions, the regulatory framework on the release and sharing of personal data is being extended to machine learning (ML). The implicit assumption is that disclosing a trained ML model entails a privacy risk for any personal data used in training comparable to directly releasing those data. However, given a trained model, it is necessary to mount a privacy attack to make inferences on the training data. In this concept paper, we examine the main families of privacy attacks against predictive and generative ML, including membership inference attacks (MIAs), property inference attacks, and reconstruction attacks. Our discussion shows that most of these attacks seem less effective in the real world than what a prima face interpretation of the related literature could suggest.

Online inference

Theory and computation for structured variational inference

Rate paper: 👍 👎 ♥ Save

Abstract
Structured variational inference constitutes a core methodology in modern statistical applications. Unlike mean-field variational inference, the approximate posterior is assumed to have interdependent structure. We consider the natural setting of star-structured variational inference, where a root variable impacts all the other ones. We prove the first results for existence, uniqueness, and self-consistency of the variational approximation. In turn, we derive quantitative approximation error bounds for the variational approximation to the posterior, extending prior work from the mean-field setting to the star-structured setting. We also develop a gradient-based algorithm with provable guarantees for computing the variational approximation using ideas from optimal transport theory. We explore the implications of our results for Gaussian measures and hierarchical Bayesian models, including generalized linear models with location family priors and spike-and-slab priors with one-dimensional debiasing. As a by-product of our analysis, we develop new stability results for star-separable transport maps which might be of independent interest.

Robust Sampling for Active Statistical Inference

Rate paper: 👍 👎 ♥ Save

Abstract
Active statistical inference is a new method for inference with AI-assisted data collection. Given a budget on the number of labeled data points that can be collected and assuming access to an AI predictive model, the basic idea is to improve estimation accuracy by prioritizing the collection of labels where the model is most uncertain. The drawback, however, is that inaccurate uncertainty estimates can make active sampling produce highly noisy results, potentially worse than those from naive uniform sampling. In this work, we present robust sampling strategies for active statistical inference. Robust sampling ensures that the resulting estimator is never worse than the estimator using uniform sampling. Furthermore, with reliable uncertainty estimates, the estimator usually outperforms standard active inference. This is achieved by optimally interpolating between uniform and active sampling, depending on the quality of the uncertainty scores, and by using ideas from robust optimization. We demonstrate the utility of the method on a series of real datasets from computational social science and survey research.

Machine Learning Operations

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

Rate paper: 👍 👎 ♥ Save

Abstract
Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7\%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM's solving accuracy by up to $4.2\%$. Remarkably, OR-R1 outperforms ORLM by over $2.4\%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1\%-6.4\%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13\%$ to $7\%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

Low-Discrepancy Set Post-Processing via Gradient Descent

Rate paper: 👍 👎 ♥ Save

Abstract
The construction of low-discrepancy sets, used for uniform sampling and numerical integration, has recently seen great improvements based on optimization and machine learning techniques. However, these methods are computationally expensive, often requiring days of computation or access to GPU clusters. We show that simple gradient descent-based techniques allow for comparable results when starting with a reasonably uniform point set. Not only is this method much more efficient and accessible, but it can be applied as post-processing to any low-discrepancy set generation method for a variety of standard discrepancy measures.

Machine Learning Validation

Beyond Uniform Deletion: A Data Value-Weighted Framework for Certified Machine Unlearning

Rate paper: 👍 👎 ♥ Save

$Paper visualization$

Rate image: 👍 👎

Abstract
As the right to be forgotten becomes legislated worldwide, machine unlearning mechanisms have emerged to efficiently update models for data deletion and enhance user privacy protection. However, existing machine unlearning algorithms frequently neglect the fact that different data points may contribute unequally to model performance (i.e., heterogeneous data values). Treat them equally in machine unlearning procedure can potentially degrading the performance of updated models. To address this limitation, we propose Data Value-Weighted Unlearning (DVWU), a general unlearning framework that accounts for data value heterogeneity into the unlearning process. Specifically, we design a weighting strategy based on data values, which are then integrated into the unlearning procedure to enable differentiated unlearning for data points with varying utility to the model. The DVWU framework can be broadly adapted to various existing machine unlearning methods. We use the one-step Newton update as an example for implementation, developing both output and objective perturbation algorithms to achieve certified unlearning. Experiments on both synthetic and real-world datasets demonstrate that our methods achieve superior predictive performance and robustness compared to conventional unlearning approaches. We further show the extensibility of our framework on gradient ascent method by incorporating the proposed weighting strategy into the gradient terms, highlighting the adaptability of DVWU for broader gradient-based deep unlearning methods.

Model Monitoring

A Theoretical Analysis of Detecting Large Model-Generated Time Series

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Motivated by the increasing risks of data misuse and fabrication, we investigate the problem of identifying synthetic time series generated by Time-Series Large Models (TSLMs) in this work. While there are extensive researches on detecting model generated text, we find that these existing methods are not applicable to time series data due to the fundamental modality difference, as time series usually have lower information density and smoother probability distributions than text data, which limit the discriminative power of token-based detectors. To address this issue, we examine the subtle distributional differences between real and model-generated time series and propose the contraction hypothesis, which states that model-generated time series, unlike real ones, exhibit progressively decreasing uncertainty under recursive forecasting. We formally prove this hypothesis under theoretical assumptions on model behavior and time series structure. Model-generated time series exhibit progressively concentrated distributions under recursive forecasting, leading to uncertainty contraction. We provide empirical validation of the hypothesis across diverse datasets. Building on this insight, we introduce the Uncertainty Contraction Estimator (UCE), a white-box detector that aggregates uncertainty metrics over successive prefixes to identify TSLM-generated time series. Extensive experiments on 32 datasets show that UCE consistently outperforms state-of-the-art baselines, offering a reliable and generalizable solution for detecting model-generated time series.

Machine Learning Lifecycle

An Adaptive Machine Learning Triage Framework for Predicting Alzheimer's Disease Progression

Emory University

Rate paper: 👍 👎 ♥ Save

Abstract
Accurate predictions of conversion from mild cognitive impairment (MCI) to Alzheimer's disease (AD) can enable effective personalized therapy. While cognitive tests and clinical data are routinely collected, they lack the predictive power of PET scans and CSF biomarker analysis, which are prohibitively expensive to obtain for every patient. To address this cost-accuracy dilemma, we design a two-stage machine learning framework that selectively obtains advanced, costly features based on their predicted "value of information". We apply our framework to predict AD progression for MCI patients using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our framework reduces the need for advanced testing by 20% while achieving a test AUROC of 0.929, comparable to the model that uses both basic and advanced features (AUROC=0.915, p=0.1010). We also provide an example interpretability analysis showing how one may explain the triage decision. Our work presents an interpretable, data-driven framework that optimizes AD diagnostic pathways and balances accuracy with cost, representing a step towards making early, reliable AD prediction more accessible in real-world practice. Future work should consider multiple categories of advanced features and larger-scale validation.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

MLOps
Data Science Development Tools
Machine Learning Deployment

You can edit or add more interests any time.

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback