Hi!

Your personalized paper recommendations for 02 to 06 February, 2026.

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

Moreh

Rate paper: 👍 👎 ♥ Save

AI Insights

Training overhead can outweigh benefits when the performance gap between high-end and low-end GPUs is small and speculative decoding speedup is modest. (ML: 0.95)👍👎
Incremental training: a method where the model is trained incrementally, with each iteration building on the previous one. (ML: 0.95)👍👎
The benefits increase with both higher GPU ratios and larger speculative speedup, but training overhead slightly outweighs benefits when the performance gap between high-end and low-end GPUs is small and speculative decoding speedup is modest. (ML: 0.95)👍👎
Temporal parallelism: a technique that involves running multiple iterations of a model in parallel to improve training efficiency. (ML: 0.95)👍👎
Further research is needed to fully explore the potential of TIDE and its variants, as well as to develop more efficient and scalable methods for large language models. (ML: 0.93)👍👎
TIDE (Temporal Incremental Draft Engine) is a system that combines temporal parallelism with incremental training to improve the performance of large language models. (ML: 0.91)👍👎
TIDE's heterogeneous strategy can achieve significant performance improvements in large language models, making it a promising approach for real-world applications. (ML: 0.89)👍👎
Speculative decoding: a technique used by TIDE to generate predictions for a sequence of tokens without waiting for the entire sequence to be processed. (ML: 0.87)👍👎
The choice of GPU configuration and speculative speedup value is crucial in determining the benefits of TIDE's heterogeneous strategy. (ML: 0.85)👍👎
TIDE's heterogeneous strategy achieves up to 1.26× relative throughput improvement for H100:MI250 (4:1) configuration with speculative speedup of s= 1.3. (ML: 0.62)👍👎

Abstract
Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.

Why we are recommending this paper?
Due to your Interest in Online inference

This paper directly addresses the user's interest in online inference and LLM infrastructure, proposing a framework for adapting LLM serving engines to evolving workloads. The focus on speculative decoding and temporal adaptation aligns strongly with the need for resilient and efficient ML deployment.

Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation

University of Washington

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The paper does not provide a comprehensive evaluation of the proposed framework's performance on real-world datasets. (ML: 0.97)👍👎
However, there are still challenges to be addressed, such as ensuring the quality and diversity of synthetic data, as well as evaluating its robustness in real-world applications. (ML: 0.95)👍👎
LLM (Large Language Model): A type of artificial intelligence model that is trained on vast amounts of text data to generate human-like language and perform various NLP tasks. (ML: 0.95)👍👎
The authors do not discuss potential limitations or challenges in implementing the LLM-based framework in practice. (ML: 0.94)👍👎
Synthetic data: Data generated artificially using algorithms or machine learning models, rather than being collected from real-world sources. (ML: 0.94)👍👎
Previous research has shown that synthetic data can be used to augment real-world datasets, improving model performance and robustness. (ML: 0.93)👍👎
The paper discusses the challenges of generating high-quality synthetic data for various applications, including natural language processing (NLP), computer vision, and tabular data. (ML: 0.91)👍👎
The authors propose an LLM-based framework for synthetic data generation, which leverages the strengths of large language models to generate realistic and diverse synthetic data. (ML: 0.90)👍👎
The paper highlights the importance of evaluating the quality of synthetic data using metrics such as fidelity, diversity, and robustness. (ML: 0.88)👍👎
The LLM-based framework for synthetic data generation has the potential to revolutionize the field of data augmentation, enabling researchers and practitioners to generate high-quality synthetic data with ease. (ML: 0.88)👍👎

Abstract
While tabular data is fundamental to many real-world machine learning (ML) applications, acquiring high-quality tabular data is usually labor-intensive and expensive. Limited by the scarcity of observations, tabular datasets often exhibit critical deficiencies, such as class imbalance, selection bias, and low fidelity. To address these challenges, building on recent advances in Large Language Models (LLMs), this paper introduces Team-then-Trim (T$^2$), a framework that synthesizes high-quality tabular data through a collaborative team of LLMs, followed by a rigorous three-stage plug-in data quality control (QC) pipeline. In T$^2$, tabular data generation is conceptualized as a manufacturing process: specialized LLMs, guided by domain knowledge, are tasked with generating different data components sequentially, and the resulting products, i.e., the synthetic data, are systematically evaluated across multiple dimensions of QC. Empirical results on both simulated and real-world datasets demonstrate that T$^2$ outperforms state-of-the-art methods in producing high-quality tabular data, highlighting its potential to support downstream models when direct data collection is practically infeasible.

Why we are recommending this paper?
Due to your Interest in Data Science Development Tools

Given the user's interest in data science development tools and MLOps, this paper’s approach to generating high-quality tabular data is highly relevant. The assembly-line framework tackles the challenges of data scarcity and deficiencies, a critical area for robust ML systems.

Cascading Robustness Verification: Toward Efficient Model-Agnostic Certification

Toronto Metropolitan University

Rate paper: 👍 👎 ♥ Save

AI Insights

The proposed Certification with Relaxation and Verification (CRV) method effectively improves robust accuracy (RA) and scalability across models with different training objectives. (ML: 0.90)👍👎
FSR typically reduces runtime by 50-60% while altering the certified robustness bounds by less than 0.5%, confirming that the proposed strategy achieves substantial speedups with negligible loss in accuracy. (ML: 0.79)👍👎
CRV achieves the same RA as SDP-cert on Grad-NN while reducing runtime by 42% and improves RA from 82% to 88% on LP-NN, reducing runtime by 82% compared to SDP-cert. (ML: 0.74)👍👎
CRV: Certification with Relaxation and Verification RA: Robust Accuracy SDP-cert: SDP-based certification method LP-cert: LP-based certification method PGD Success: PGD (Projected Gradient Descent) attack success rate SR: Selective Relaxation approach (ML: 0.73)👍👎
The Selective Relaxation (SR) approach maintains the accuracy of the tightest relaxation (V13) while reducing runtime by 29.65%, and the Fast Selective Relaxation (FSR) further improves efficiency with a 41.65% speedup. (ML: 0.67)👍👎

Abstract
Certifying neural network robustness against adversarial examples is challenging, as formal guarantees often require solving non-convex problems. Hence, incomplete verifiers are widely used because they scale efficiently and substantially reduce the cost of robustness verification compared to complete methods. However, relying on a single verifier can underestimate robustness because of loose approximations or misalignment with training methods. In this work, we propose Cascading Robustness Verification (CRV), which goes beyond an engineering improvement by exposing fundamental limitations of existing robustness metric and introducing a framework that enhances both reliability and efficiency. CRV is a model-agnostic verifier, meaning that its robustness guarantees are independent of the model's training process. The key insight behind the CRV framework is that, when using multiple verification methods, an input is certifiably robust if at least one method certifies it as robust. Rather than relying solely on a single verifier with a fixed constraint set, CRV progressively applies multiple verifiers to balance the tightness of the bound and computational cost. Starting with the least expensive method, CRV halts as soon as an input is certified as robust; otherwise, it proceeds to more expensive methods. For computationally expensive methods, we introduce a Stepwise Relaxation Algorithm (SR) that incrementally adds constraints and checks for certification at each step, thereby avoiding unnecessary computation. Our theoretical analysis demonstrates that CRV achieves equal or higher verified accuracy compared to powerful but computationally expensive incomplete verifiers in the cascade, while significantly reducing verification overhead. Empirical results confirm that CRV certifies at least as many inputs as benchmark approaches, while improving runtime efficiency by up to ~90%.

Why we are recommending this paper?
Due to your Interest in Model Monitoring

This paper’s focus on model-agnostic certification directly addresses the user’s interest in model validation and testing, particularly concerning robustness. The work on efficient verification methods is crucial for ensuring reliable and trustworthy ML deployments.

Quality Model for Machine Learning Components

Carnegie Mellon University

Rate paper: 👍 👎 ♥ Save

AI Insights

Next steps for this work include larger-scale evaluation of the quality model, user study of the quality model as part of the MLTE tool, and development of a data quality model using the same methodology. (ML: 0.98)👍👎
Next steps for this work include larger-scale evaluation of the quality model, user study of the quality model as part of the MLTE tool, and development of a data quality model using the same methodology. (ML: 0.98)👍👎
The proposed quality model for ML components is the first empirically developed and validated quality model that focuses on testable quality attributes for ML components. (ML: 0.95)👍👎
Small sample size of survey participants (ML: 0.95)👍👎
The proposed quality model for ML components is a valuable resource for developers and testers to ensure the quality of ML components. (ML: 0.94)👍👎
ML components: Machine learning components are software components that contain machine learning models. (ML: 0.94)👍👎
The model has been successfully integrated into MLTE, an open source tool for test and evaluation of ML components as a resource to guide elicitation and negotiation of system-derived requirements. (ML: 0.91)👍👎
The model has been successfully integrated into MLTE, an open source tool for test and evaluation of ML components. (ML: 0.90)👍👎
Quality attributes: Quality attributes are characteristics or properties of a system that describe its behavior, performance, or other aspects. (ML: 0.90)👍👎
System-derived requirements: System-derived requirements are requirements that are derived from the system's architecture and design. (ML: 0.80)👍👎

Abstract
Despite increased adoption and advances in machine learning (ML), there are studies showing that many ML prototypes do not reach the production stage and that testing is still largely limited to testing model properties, such as model performance, without considering requirements derived from the system it will be a part of, such as throughput, resource consumption, or robustness. This limited view of testing leads to failures in model integration, deployment, and operations. In traditional software development, quality models such as ISO 25010 provide a widely used structured framework to assess software quality, define quality requirements, and provide a common language for communication with stakeholders. A newer standard, ISO 25059, defines a more specific quality model for AI systems. However, a problem with this standard is that it combines system attributes with ML component attributes, which is not helpful for a model developer, as many system attributes cannot be assessed at the component level. In this paper, we present a quality model for ML components that serves as a guide for requirements elicitation and negotiation and provides a common vocabulary for ML component developers and system stakeholders to agree on and define system-derived requirements and focus their testing efforts accordingly. The quality model was validated through a survey in which the participants agreed with its relevance and value. The quality model has been successfully integrated into an open-source tool for ML component testing and evaluation demonstrating its practical application.

Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

The paper’s exploration of testing ML components aligns with the user's interests in ML validation and MLOps. It addresses the critical issue of moving ML prototypes from research to production, a key concern for operationalizing ML models.

Comparative Insights on Adversarial Machine Learning from Industry and Academia: A User-Study Approach

Carnegie Mellon University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

CTF challenges can be an effective way to learn about Machine Learning (ML) concepts. (ML: 0.97)👍👎
However, more targeted integration of ML vulnerabilities in CTF-style learning is needed to engage participants more effectively. (ML: 0.97)👍👎
Participants reported increased recognition of the connection between ML and security after attempting CTF challenges. (ML: 0.97)👍👎
Most participants found the first challenge between neutral and easy, while the second challenge was considered difficult. (ML: 0.95)👍👎
The study's results suggest that CTF challenges can be a useful tool for AML education, but further research is necessary to improve their effectiveness. (ML: 0.94)👍👎
CTF challenges can be an effective way to learn about ML concepts and their connection to security. (ML: 0.92)👍👎
Only 12 out of 15 participants were able to solve the first challenge and find the flag, but none solved the second challenge. (ML: 0.90)👍👎
Likert scale: A rating scale used in surveys to measure opinions or attitudes, with options ranging from strongly disagree to strongly agree. (ML: 0.88)👍👎
AML: Adversarial Machine Learning, a subfield of ML that focuses on developing techniques to defend against attacks on ML models. (ML: 0.88)👍👎
CTF: Capture The Flag, a type of cybersecurity competition where participants must solve challenges to capture flags. (ML: 0.59)👍👎

Abstract
An exponential growth of Machine Learning and its Generative AI applications brings with it significant security challenges, often referred to as Adversarial Machine Learning (AML). In this paper, we conducted two comprehensive studies to explore the perspectives of industry professionals and students on different AML vulnerabilities and their educational strategies. In our first study, we conducted an online survey with professionals revealing a notable correlation between cybersecurity education and concern for AML threats. For our second study, we developed two CTF challenges that implement Natural Language Processing and Generative AI concepts and demonstrate a poisoning attack on the training data set. The effectiveness of these challenges was evaluated by surveying undergraduate and graduate students at Carnegie Mellon University, finding that a CTF-based approach effectively engages interest in AML threats. Based on the responses of the participants in our research, we provide detailed recommendations emphasizing the critical need for integrated security education within the ML curriculum.

Why we are recommending this paper?
Due to your Interest in Machine Learning Infrastructure

Given the user's interest in fault tolerance, resilience, and adversarial ML, this paper offers valuable insights from industry and academic perspectives. Understanding the challenges and approaches to adversarial attacks is essential for building robust ML systems.

Turning mechanistic models into forecasters by using machine learning

University of Alberta

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Sparse Regression (STRR): A method for identifying the most relevant features in a dataset by selecting a subset of the original features. (ML: 0.95)👍👎
The paper presents experimental results on two datasets: SIR (Susceptible-Infected-Recovered) and CR (Consumer-Retailer). (ML: 0.95)👍👎
Mean Absolute Error (MAE): A measure of the average difference between predicted and actual values, used to evaluate the performance of a model. (ML: 0.92)👍👎
Random Forests (RF): An ensemble learning method that combines multiple decision trees to improve predictive accuracy and robustness. (ML: 0.91)👍👎
Fixed-Parameter Models: Models with constant parameters, which can be less accurate than time-varying parameter models when the system dynamics are non-stationary. (ML: 0.88)👍👎
The research paper discusses a new approach to modeling complex systems using a combination of sparse regression and random forests. (ML: 0.87)👍👎
The authors propose a method called STRR+RF, which uses sparse regression to identify the most relevant features in the data and then applies random forests to model the system dynamics. (ML: 0.85)👍👎
The time-varying parameter model outperformed the fixed-parameter model in both datasets, achieving lower mean absolute errors (MAEs) for the susceptible and infected populations. (ML: 0.85)👍👎
Time-Varying Parameters: Model parameters that change over time, allowing for more flexible and accurate modeling of complex systems. (ML: 0.80)👍👎
The authors also provide theoretical results on the advantages of using time-varying parameters over fixed parameters, showing that the former can represent coefficient trajectories more accurately and admit tighter worst-case bounds on finite-horizon forecast errors. (ML: 0.67)👍👎

Abstract
The equations of complex dynamical systems may not be identified by expert knowledge, especially if the underlying mechanisms are unknown. Data-driven discovery methods address this challenge by inferring governing equations from time-series data using a library of functions constructed from the measured variables. However, these methods typically assume time-invariant coefficients, which limits their ability to capture evolving system dynamics. To overcome this limitation, we allow some of the parameters to vary over time, learn their temporal evolution directly from data, and infer a system of equations that incorporates both constant and time-varying parameters. We then transform this framework into a forecasting model by predicting the time-varying parameters and substituting these predictions into the learned equations. The model is validated using datasets for Susceptible-Infected-Recovered, Consumer--Resource, greenhouse gas concentration, and Cyanobacteria cell count. By dynamically adapting to temporal shifts, our proposed model achieved a mean absolute error below 3\% for learning a time series and below 6\% for forecasting up to a month ahead. We additionally compare forecasting performance against CNN-LSTM and Gradient Boosting Machine (GBM), and show that our model outperforms these methods across most datasets. Our findings demonstrate that integrating time-varying parameters into data-driven discovery of differential equations improves both modeling accuracy and forecasting performance.

Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

A Probabilistic Model-Checking Framework for Cognitive Assessment and Training

Universit Cte dAzur

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

In this context, it is applied to medical serious games to ensure that the game adapts correctly to user behavior and provides accurate assessments of cognitive function. (ML: 0.97)👍👎
This approach allows for personalized training and assessment of cognitive function. (ML: 0.97)👍👎
The paper discusses the use of model checking in medical serious games to assess cognitive function and provide personalized training for individuals with neurodegenerative diseases. (ML: 0.96)👍👎
The proposed framework combines model checking with probabilistic activity recognition to create a robust system for assessing cognitive function and providing personalized training in medical serious games. (ML: 0.93)👍👎
The discussion on AI and IoT in clinical medicine could be expanded to provide a more comprehensive overview of their potential benefits and challenges. (ML: 0.93)👍👎
The authors also discuss the application of artificial intelligence (AI) and Internet of Things (IoT) in clinical medicine, highlighting challenges and future directions. (ML: 0.93)👍👎
The paper may benefit from more detailed explanations and examples of the proposed framework's application in real-world scenarios. (ML: 0.92)👍👎
The use of AI and IoT in clinical medicine is discussed, highlighting the potential benefits and challenges associated with their integration into healthcare systems. (ML: 0.89)👍👎
A probabilistic activity recognition framework is proposed to analyze user behavior in serious games, enabling real-time adaptation of game complexity based on prior performance. (ML: 0.89)👍👎
Serious games: Interactive computer-based applications used for educational or therapeutic purposes, often incorporating elements of gameplay to engage users and promote learning or improvement. (ML: 0.86)👍👎
Probabilistic activity recognition: A framework for analyzing user behavior in real-time, enabling the adaptation of game complexity based on prior performance. (ML: 0.85)👍👎
Model checking: A formal verification technique used to check whether a system satisfies certain properties or not. (ML: 0.83)👍👎

Abstract
Serious games have proven to be effective tools for screening cognitive impairments and supporting diagnosis in patients with neurodegenerative diseases like Alzheimer's and Parkinson's. They also offer cognitive training benefits. According to the DSM-5 classification, cognitive disorders are categorized as Mild Neurocognitive Disorders (mild NCDs) and Major Neurocognitive Disorders (Major NCDs). In this study, we focus on three patient groups: healthy, mild NCD, and Major NCD. We employ Discrete Time Markov Chains to model the behavior exhibited by each group while interacting with serious games. By applying model-checking techniques, we can identify discrepancies between expected and actual gameplay behavior. The primary contribution of this work is a novel theoretical framework designed to assess how a practitioner's confidence level in diagnosing a patient's Alzheimer's stage evolves with each game session (diagnosis support). Additionally, we propose an experimental protocol where the difficulty of subsequent game sessions is dynamically adjusted based on the patient's observed behavior in previous sessions (training support).

Why we are recommending this paper?
Due to your Interest in Model Monitoring

Applying a Requirements-Focused Agile Management Approach for Machine Learning-Enabled Systems

PUCRio

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Limitations include difficulties in operationalizing PerSpecML into ML backlog and reliance on experienced facilitation, as well as the stochastic nature of ML rendering accurate estimation a remaining 'unsolved pain'. (ML: 0.99)👍👎
The approach can enhance shared understanding and alignment between customer stakeholders and practitioners by making requirements, success criteria, and constraints explicit. (ML: 0.97)👍👎
LoDs: Levels of Detail, a framework for describing model maturity and progress. (ML: 0.97)👍👎
PerSpecML: A requirements-focused approach to ML project management. (ML: 0.95)👍👎
The approach can be applied not only to greenfield projects but also to the evolution of ongoing systems. (ML: 0.95)👍👎
RefineML was considered useful for managing ML-enabled systems within an agile context. (ML: 0.95)👍👎
RefineML: An agile methodology for managing machine learning (ML) enabled systems. (ML: 0.94)👍👎
Agile4MLS: An agile methodology for managing machine learning (ML) enabled systems. (ML: 0.93)👍👎
MVM: Model Versioning Management, a practice for tracking and managing model versions. (ML: 0.90)👍👎
RefineML's dual-track structure orchestrates coordination between model research and product development, integrating Agile4MLS principles with practices derived from systematic mapping. (ML: 0.88)👍👎

Abstract
Machine Learning (ML)-enabled systems challenge traditional Requirements Engineering (RE) and agile management due to data dependence, experimentation, and uncertain model behavior. Existing RE and agile practices remain poorly integrated and insufficiently tailored to these characteristics. This paper reports on the practical experience of applying RefineML, a requirements-focused approach for the continuous and agile refinement of ML-enabled systems, which integrates ML-tailored specification and agile management approaches with best practices derived from a systematic mapping study. The application context concerns an industry-academia collaboration project between PUC-Rio and EXA, a Brazilian cybersecurity company. For evaluation purposes, we applied questionnaires assessing RefineML's suitability and overall acceptance and semi-structured interviews. We applied thematic analysis to the collected qualitative data. Regarding suitability and acceptance, the results of the questionnaires indicated high perceived usefulness and intention to use. Based on the interviews, stakeholders perceived RefineML as improving communication and facilitating early feasibility assessments, as well as enabling dual-track governance of ML and software work, allowing continuous refinement of the model while evolving the overall software project. However, some limitations remain, particularly related to difficulties in operationalizing ML concerns into agile requirements and in estimating ML effort.

Why we are recommending this paper?
Due to your Interest in Machine Learning Deployment

On Computation and Reinforcement Learning

Princeton University

Rate paper: 👍 👎 ♥ Save

AI Insights

However, there are still challenges to be addressed, such as improving sample efficiency, generalizing to new environments, and scaling up to complex tasks. (ML: 0.98)👍👎
Generalization: RL models may not generalize well to new environments or tasks, requiring retraining from scratch. (ML: 0.98)👍👎
Sample efficiency: RL methods often require large amounts of data to learn effective policies, which can be time-consuming and expensive. (ML: 0.97)👍👎
Imagine you're trying to learn how to play a video game. (ML: 0.96)👍👎
Reinforcement learning is a subfield of machine learning that involves training agents to take actions in an environment to maximize a reward signal. (ML: 0.96)👍👎
That's basically what reinforcement learning is – it's a way for computers to learn from experience and get better at doing things. (ML: 0.96)👍👎
Reinforcement learning (RL) is a subfield of machine learning that involves training agents to take actions in an environment to maximize a reward signal. (ML: 0.96)👍👎
Scalability: As tasks become more complex, RL methods can struggle to scale up, leading to increased computational costs. (ML: 0.95)👍👎
As you play more, you start to notice patterns and figure out which actions lead to the best rewards (like getting extra lives or points). (ML: 0.94)👍👎
RL has been successful in various applications, including robotics, game playing, and autonomous driving. (ML: 0.91)👍👎
Recent papers have explored various aspects of RL, including model-based and model-free methods, offline RL, and transfer learning. (ML: 0.91)👍👎
Some notable results include the development of new algorithms for improving sample efficiency and generalization, as well as applications in robotics, game playing, and autonomous driving. (ML: 0.89)👍👎
Model-based RL: A type of RL where the agent learns a model of the environment and uses it to plan its actions. (ML: 0.89)👍👎
Model-free RL: A type of RL where the agent learns directly from experience without learning a model of the environment. (ML: 0.87)👍👎
You start by making random moves and seeing what happens. (ML: 0.83)👍👎
RL has made significant progress in recent years, with many state-of-the-art results achieved through model-based and model-free methods. (ML: 0.65)👍👎

Abstract
How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.

Why we are recommending this paper?
Due to your Interest in Machine Learning Operations

Mining Generalizable Activation Functions

Google DeepMind

Rate paper: 👍 👎 ♥ Save

AI Insights

They have the potential to enhance feature learning, robustness, and generalization capabilities in neural networks. (ML: 0.94)👍👎
Lack of clear documentation and explanations for each activation function. (ML: 0.92)👍👎
Activation Functions: Mathematical operations applied element-wise to the output of each neuron in a neural network, introducing non-linearity and enabling the network to learn complex patterns. (ML: 0.91)👍👎
Out-of-Distribution (OOD) Regularization: Techniques used to improve a model's performance on unseen data by detecting and suppressing anomalous or noisy inputs. (ML: 0.90)👍👎
Other activation functions, like the Quaternion-Inspired Activation Function, introduce complex oscillations and semi-orthogonal components to enhance feature learning and robustness. (ML: 0.81)👍👎
The provided code snippet appears to be a collection of various activation functions for neural networks, each with its own unique characteristics and purposes. (ML: 0.79)👍👎
Quaternion-Inspired Activation Function: A novel activation function that introduces complex oscillations and semi-orthogonal components, enhancing feature learning and robustness. (ML: 0.71)👍👎
The provided code snippet showcases various innovative activation functions designed for specific tasks, such as OOD regularization and anomaly detection. (ML: 0.62)👍👎
Some of the activation functions, such as Fourier-Informed Spectral Gating (FISG) and Phase-Locked Entropic Repulsion (PLER), are designed specifically for Out-of-Distribution (OOD) regularization and anomaly detection. (ML: 0.60)👍👎
These activation functions leverage mathematical concepts like Fourier analysis, chaotic systems, and quaternion algebra to introduce complex oscillations and semi-orthogonal components. (ML: 0.52)👍👎

Abstract
The choice of activation function is an active area of research, with different proposals aimed at improving optimization, while maintaining expressivity. Additionally, the activation function can significantly alter the implicit inductive bias of the architecture, controlling its non-linear behavior. In this paper, in line with previous work, we argue that evolutionary search provides a useful framework for finding new activation functions, while we also make two novel observations. The first is that modern pipelines, such as AlphaEvolve, which relies on frontier LLMs as a mutator operator, allows for a much wider and flexible search space; e.g., over all possible python functions within a certain FLOP budget, eliminating the need for manually constructed search spaces. In addition, these pipelines will be biased towards meaningful activation functions, given their ability to represent common knowledge, leading to a potentially more efficient search of the space. The second observation is that, through this framework, one can target not only performance improvements but also activation functions that encode particular inductive biases. This can be done by using performance on out-of-distribution data as a fitness function, reflecting the degree to which the architecture respects the inherent structure in the data in a manner independent of distribution shifts. We carry an empirical exploration of this proposal and show that relatively small scale synthetic datasets can be sufficient for AlphaEvolve to discover meaningful activations.

Why we are recommending this paper?
Due to your Interest in Machine Learning Operations

CSLib: The Lean Computer Science Library

Amazon

Rate paper: 👍 👎 ♥ Save

AI Insights

One potential weakness of this approach is that it may be limited by the availability of high-quality training data and the complexity of the problems being addressed. (ML: 0.98)👍👎
With AI, this process becomes even more efficient and accurate. (ML: 0.98)👍👎
Imagine you're trying to solve a complex math problem, but you're not sure where to start. (ML: 0.95)👍👎
The use of AI in formal mathematical reasoning has opened up new possibilities for solving complex problems and making accurate predictions. (ML: 0.93)👍👎
Formal mathematical reasoning has been increasingly used in various fields, including computer science, mathematics, and artificial intelligence. (ML: 0.93)👍👎
Formal mathematical reasoning is like having a super-smart assistant that can help you break down the problem into smaller parts and find the solution. (ML: 0.92)👍👎
The main idea is to explore the intersection of formal mathematical reasoning and AI, highlighting its potential applications and benefits. (ML: 0.88)👍👎
The paper discusses the development and application of formal mathematical reasoning using AI. (ML: 0.88)👍👎

Abstract
We introduce CSLib, an open-source framework for proving computer-science-related theorems and writing formally verified code in the Lean proof assistant. CSLib aims to be for computer science what Lean's Mathlib is for mathematics. Mathlib has been tremendously impactful: it is a key reason for Lean's popularity within the mathematics research community, and it has also played a critical role in the training of AI systems for mathematical reasoning. However, the base of computer science knowledge in Lean is currently quite limited. CSLib will vastly enhance this knowledge base and provide infrastructure for using this knowledge in real-world verification projects. By doing so, CSLib will (1) enable the broad use of Lean in computer science education and research, and (2) facilitate the manual and AI-aided engineering of large-scale formally verified systems.

Why we are recommending this paper?
Due to your Interest in Data Science Development Tools

Unified Inference Framework for Single and Multi-Player Performative Prediction: Method and Asymptotic Optimality

Rutgers University

Rate paper: 👍 👎 ♥ Save

AI Insights

Asymptotic bias: A potential issue in naive plug-ins where the resulting estimator may have an asymptotically biased variance. (ML: 0.97)👍👎
The authors demonstrate that their proposed method can achieve semiparametric efficiency among all comparable algorithms, leveraging the results from surrogate outcomes literatures. (ML: 0.96)👍👎
The paper presents several key insights, including the use of imputed loss functions for estimating distributional parameters, the importance of consistent estimation of conditional expectations, and the potential for asymptotic bias in naive plug-ins. (ML: 0.96)👍👎
The authors propose a two-stage analysis to highlight the layered structure of plug-in performative optimization and clarify the dependencies between the two estimations. (ML: 0.94)👍👎
Conditional expectation: The expected value of a random variable given another random variable. (ML: 0.93)👍👎
Recalibrated plug-in estimation: A two-stage analysis that integrates the plug-in optimization framework with the construction idea of RePPI. (ML: 0.92)👍👎
Performative optimality: The goal is to find an estimator that minimizes the risk function under a given distributional parameter. (ML: 0.91)👍👎
Imputed loss functions: Used for estimating distributional parameters, these functions are closely related to the efficient influence function of the target distributional parameter. (ML: 0.86)👍👎
The paper also discusses the asymptotic properties of both estimators separately and demonstrates how they are interlinked in the resulting asymptotic guarantees. (ML: 0.84)👍👎
The paper discusses a novel estimation procedure for performative optimality in multi-player settings. (ML: 0.83)👍👎
The method, called recalibrated plug-in estimation, integrates the plug-in optimization framework with the construction idea of RePPI. (ML: 0.81)👍👎

Abstract
Performative prediction characterizes environments where predictive models alter the very data distributions they aim to forecast, triggering complex feedback loops. While prior research treats single-agent and multi-agent performativity as distinct phenomena, this paper introduces a unified statistical inference framework that bridges these contexts, treating the former as a special case of the latter. Our contribution is two-fold. First, we put forward the Repeated Risk Minimization (RRM) procedure for estimating the performative stability, and establish a rigorous inferential theory for admitting its asymptotic normality and confirming its asymptotic efficiency. Second, for the performative optimality, we introduce a novel two-step plug-in estimator that integrates the idea of Recalibrated Prediction Powered Inference (RePPI) with Importance Sampling, and further provide formal derivations for the Central Limit Theorems of both the underlying distributional parameters and the plug-in results. The theoretical analysis demonstrates that our estimator achieves the semiparametric efficiency bound and maintains robustness under mild distributional misspecification. This work provides a principled toolkit for reliable estimation and decision-making in dynamic, performative environments.

Why we are recommending this paper?
Due to your Interest in Online inference

Beyond the Commit: Developer Perspectives on Productivity with AI Coding Assistants

Carnegie Mellon University

Rate paper: 👍 👎 ♥ Save

AI Insights

The six factors that capture both immediate and long-term impacts of AI coding assistants are: self-sufficiency, reduced cognitive load, time savings, job satisfaction, long-term expertise, and ownership. (ML: 0.99)👍👎
The study was limited to developers who used GitHub Copilot, which may not be representative of all AI coding assistants. (ML: 0.98)👍👎
Established measures and frameworks for understanding productivity need to be rethought in the age of AI coding assistants. (ML: 0.98)👍👎
The study found high satisfaction but only modest time savings among developers using AI coding assistants. (ML: 0.98)👍👎
Understanding developer productivity in the age of AI coding assistants requires rethinking established measures and frameworks. (ML: 0.97)👍👎
AI coding assistants: tools that use artificial intelligence to assist developers with their work Developer productivity: a measure of how efficiently and effectively developers complete their tasks The study provides a foundation for more holistic evaluations of AI coding assistants for future research and industry deployments. (ML: 0.97)👍👎
A mixed-methods approach was used to understand developer productivity using AI coding assistants, including a survey and semi-structured interviews. (ML: 0.97)👍👎
Developer productivity using AI coding assistants is a complex issue that cannot be captured by single metrics alone. (ML: 0.97)👍👎

Abstract
Measuring developer productivity is a topic that has attracted attention from both academic research and industrial practice. In the age of AI coding assistants, it has become even more important for both academia and industry to understand how to measure their impact on developer productivity, and to reconsider whether earlier measures and frameworks still apply. This study analyzes the validity of different approaches to evaluating the productivity impacts of AI coding assistants by leveraging mixed-method research. At BNY Mellon, we conduct a survey with 2989 developer responses and 11 in-depth interviews. Our findings demonstrate that a multifaceted approach is needed to measure AI productivity impacts: survey results expose conflicting perspectives on AI tool usefulness, while interviews elicit six distinct factors that capture both short-term and long-term dimensions of productivity. In contrast to prior work, our factors highlight the importance of long-term metrics like technical expertise and ownership of work. We hope this work encourages future research to incorporate a broader range of human-centered factors, and supports industry in adopting more holistic approaches to evaluating developer productivity.

Why we are recommending this paper?
Due to your Interest in Data Science Development Environment and Productivity

Emergence-as-Code for Self-Governing Reliable Systems

Innopolis University

Rate paper: 👍 👎 ♥ Save

AI Insights

Intent represents the desired system behavior, Evidence captures the actual system performance, and Governance ensures that the system meets its intended behavior. (ML: 0.98)👍👎
Emergence-as-Code for Self-Governing Reliable Systems Intent: The desired system behavior. (ML: 0.94)👍👎
The EaC framework consists of three main components: Intent, Evidence, and Governance. (ML: 0.92)👍👎
Emerge-as-Code (EaC) proposes a novel approach to make end-to-end journey reliability computable from intent plus evidence. (ML: 0.90)👍👎
EaC uses a compiler-controller interface to produce governance artifacts such as alerts, rollout gates, and constrained actions from intent and evidence. (ML: 0.88)👍👎
EaC has several benefits including improved reliability, reduced downtime, and increased efficiency. (ML: 0.86)👍👎
Emergence-as-Code for Self-Governing Reliable Systems Reliability in microservices is emergent from interactions, yet current SLO practice remains mostly local. (ML: 0.81)👍👎
The EaC framework is designed to be extensible and adaptable to different use cases and environments. (ML: 0.77)👍👎

Abstract
SLO-as-code has made per-service} reliability declarative, but user experience is defined by journeys whose reliability is an emergent property of microservice topology, routing, redundancy, timeouts/fallbacks, shared failure domains, and tail amplification. As a result, journey objectives (e.g., "checkout p99 < 400 ms") are often maintained outside code and drift as the system evolves, forcing teams to either miss user expectations or over-provision and gate releases with ad-hoc heuristics. We propose Emergence-as-Code (EmaC), a vision for making journey reliability computable and governable via intent plus evidence. An EmaC spec declares journey intent (objective, control-flow operators, allowed actions) and binds it to atomic SLOs and telemetry. A runtime inference component consumes operational artifacts (e.g., tracing and traffic configuration) to synthesize a candidate journey model with provenance and confidence. From the last accepted model, the EmaC compiler/controller derives bounded journey SLOs and budgets under explicit correlation assumptions (optimistic independence vs. pessimistic shared fate), and emits control-plane artifacts (burn-rate alerts, rollout gates, action guards) that are reviewable in a Git workflow. An anonymized artifact repository provides a runnable example specification and generated outputs.

Why we are recommending this paper?
Due to your Interest in Fault tolerance

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Writer, Inc

Rate paper: 👍 👎 ♥ Save

AI Insights

Limited scope: The study focuses on a specific type of intervention (LLM critic) and a particular set of tasks (HotPotQA). (ML: 0.98)👍👎
However, interventions can also correct factual errors and retrieve missing knowledge, improving the agent's performance. (ML: 0.98)👍👎
The study highlights that accurate failure prediction does not necessarily imply effective failure prevention using interventions. (ML: 0.98)👍👎
The study explores the effectiveness of interventions in preventing failures in large language model (LLM) agents. (ML: 0.98)👍👎
Interventions can have both positive and negative effects on the agent's performance, depending on the specific task and context. (ML: 0.98)👍👎
Interventions can sometimes disrupt the agent's reasoning, leading to incorrect answers or failure to produce output. (ML: 0.97)👍👎
Success rate: the proportion of tasks completed successfully by the agent, either with or without interventions. (ML: 0.97)👍👎
Further research is needed to develop more effective intervention strategies and improve the robustness of LLM agents. (ML: 0.96)👍👎
Interventions: corrective actions taken by the LLM critic to prevent or correct failures in the agent's reasoning. (ML: 0.96)👍👎
LLM critic: a module that evaluates the agent's actions and provides feedback for improvement. (ML: 0.89)👍👎

Abstract
Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.

Why we are recommending this paper?
Due to your Interest in Fault tolerance

Improving Deep Learning Library Testing with Machine Learning

North Carolina State University

Rate paper: 👍 👎 ♥ Save

AI Insights

Cohen's d: A statistical measure used to calculate the effect size between two distributions. (ML: 0.99)👍👎
The ML models may not always be able to accurately predict the validity of inputs, especially when the training data contains a low ratio of valid inputs. (ML: 0.99)👍👎
ACETest+ML: An ML-based pre-filtering approach that uses machine learning models to filter out invalid inputs before calling the target API. (ML: 0.93)👍👎
ACETest+ML also showed a medium effect size (Cohen's d = 0.76) in terms of the difference between the pass rates achieved by ACETest and ACETest+ML. (ML: 0.93)👍👎
The ML-based pre-filtering approach, ACETest+ML, was able to reduce the average time taken to generate inputs by 61% compared to the traditional approach, ACETest. (ML: 0.92)👍👎
The ML models were able to correctly predict as valid 72% of bug-triggering inputs, indicating that ACETest+ML is also able to detect bugs. (ML: 0.91)👍👎
The inclusion of ML models in testing deep learning libraries can significantly improve the efficiency and effectiveness of testing. (ML: 0.89)👍👎
The use of machine learning (ML) models in testing deep learning libraries can significantly improve the efficiency and effectiveness of testing. (ML: 0.87)👍👎
ACETest+ML is a promising approach for improving the testing of deep learning libraries. (ML: 0.79)👍👎
ACETest: A traditional testing approach for deep learning libraries. (ML: 0.79)👍👎

Abstract
Deep Learning (DL) libraries like TensorFlow and Pytorch simplify machine learning (ML) model development but are prone to bugs due to their complex design. Bug-finding techniques exist, but without precise API specifications, they produce many false alarms. Existing methods to mine API specifications lack accuracy. We explore using ML classifiers to determine input validity. We hypothesize that tensor shapes are a precise abstraction to encode concrete inputs and capture relationships of the data. Shape abstraction severely reduces problem dimensionality, which is important to facilitate ML training. Labeled data are obtained by observing runtime outcomes on a sample of inputs and classifiers are trained on sets of labeled inputs to capture API constraints. Our evaluation, conducted over 183 APIs from TensorFlow and Pytorch, shows that the classifiers generalize well on unseen data with over 91% accuracy. Integrating these classifiers into the pipeline of ACETest, a SoTA bug-finding technique, improves its pass rate from ~29% to ~61%. Our findings suggest that ML-enhanced input classification is an important aid to scale DL library testing.

Why we are recommending this paper?
Due to your Interest in Machine Learning Testing

Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study

Heidelberg University

Rate paper: 👍 👎 ♥ Save

AI Insights

The study highlights the importance of considering the limitations and potential biases of LLMs when applying them to real-world problems. (ML: 0.99)👍👎
The study shows that even with advanced language models, these computers can struggle to get it right. (ML: 0.98)👍👎
Limited dataset: The study used a limited dataset, which may not be representative of all possible test scenarios. (ML: 0.98)👍👎
The paper discusses the limitations of using Large Language Models (LLMs) for classifying flaky tests. (ML: 0.98)👍👎
The paper explores the limitations of using Large Language Models (LLMs) for classifying flaky tests and discusses potential solutions to improve their accuracy. (ML: 0.97)👍👎
So, more research is needed to make them better at identifying flaky tests. (ML: 0.96)👍👎
Previous studies have shown that LLMs can be effective in certain NLP tasks, but their performance may degrade when dealing with complex or ambiguous text. (ML: 0.96)👍👎
Imagine you're trying to teach a computer to identify whether a test is flaky (i.e., it fails sometimes but not always). (ML: 0.95)👍👎
Large Language Model (LLM): A type of artificial intelligence model designed to process and generate human-like language. (ML: 0.95)👍👎
The authors suggest that more research is needed to improve the accuracy of LLM-based flaky test classification. (ML: 0.95)👍👎
The study found that LLMs are not effective in classifying flaky tests, especially when the test code is complex or has multiple dependencies. (ML: 0.94)👍👎
You'd want this computer to be super accurate, right? (ML: 0.93)👍👎
But what if the test code is really complex or has many dependencies? (ML: 0.92)👍👎
Complexity of test code: LLMs struggled with complex or multi-dependent test code, highlighting the need for more advanced techniques to handle such cases. (ML: 0.91)👍👎
Flaky test: A test that fails intermittently or unexpectedly, often due to external factors such as network issues or timing dependencies. (ML: 0.81)👍👎

Abstract
Flaky tests yield inconsistent results when they are repeatedly executed on the same code revision. They interfere with automated quality assurance of code changes and hinder efficient software testing. Previous work evaluated approaches to train machine learning models to classify flaky tests based on identifiers in the test code. However, the resulting classifiers have been shown to lack generalizability, hindering their applicability in practical environments. Recently, pre-trained Large Language Models (LLMs) have shown the capability to generalize across various tasks. Thus, they represent a promising approach to address the generalizability problem of previous approaches. In this study, we evaluated three LLMs (two general-purpose models, one code-specific model) using three prompting techniques on two benchmark datasets from prior studies on flaky test classification. Furthermore, we manually investigated 50 samples from the given datasets to determine whether classifying flaky tests based only on test code is feasible for humans. Our findings indicate that LLMs struggle to classify flaky tests given only the test code. The results of our best prompt-model combination were only marginally better than random guessing. In our manual analysis, we found that the test code does not necessarily contain sufficient information for a flakiness classification. Our findings motivate future work to evaluate LLMs for flakiness classification with additional context, for example, using retrieval-augmented generation or agentic AI.

Why we are recommending this paper?
Due to your Interest in Machine Learning Testing

Perfect Network Resilience in Polynomial Time

TU Berlin

Rate paper: 👍 👎 ♥ Save

AI Insights

The definition of perfect resilience is not explicitly stated in the problem statement. (ML: 0.87)👍👎
Skipping priority list: A function πv that takes as input a link in Ev (or ⊥ modeling that the package starts inv) and outputs a permutation of Ev. (ML: 0.86)👍👎
The problem of perfect resilience in rooted graphs involves finding a forwarding pattern that can handle link failures and still deliver packages efficiently. (ML: 0.77)👍👎
Otherwise, it is called a trap. (ML: 0.76)👍👎
The authors also proposed a new forwarding pattern called the 'right-hand rule' which they showed to be perfectly resilient for certain types of graphs. (ML: 0.73)👍👎
Perfectly resilient: A rooted graph for which a perfectly resilient forwarding pattern exists. (ML: 0.73)👍👎
The concept of perfect resilience was first introduced by [1] as a way to handle link failures in packet networks. (ML: 0.72)👍👎
The updated right-hand rule λevv for each node v and some chosen link ev for v will be perfectly resilient in both cases. (ML: 0.71)👍👎
Minimal trap: A trap that does not contain another trap as a rooted minor. (ML: 0.70)👍👎
To show that dipole outerplanar graphs and rings of outerplanar graphs are perfectly resilient, we compute an outerplanar embedding for each induced subgraph, then 'stack' these graphs on top of each other, and argue that traversing an outer face of any graph yields a solution. (ML: 0.70)👍👎
Dipole outerplanar graphs and rings of outerplanar graphs are perfectly resilient. (ML: 0.69)👍👎

Abstract
Modern communication networks support local fast rerouting mechanisms to quickly react to link failures: nodes store a set of conditional rerouting rules which define how to forward an incoming packet in case of incident link failures. The rerouting decisions at any node $v$ must rely solely on local information available at $v$: the link from which a packet arrived at $v$, the target of the packet, and the incident link failures at $v$. Ideally, such rerouting mechanisms provide perfect resilience: any packet is routed from its source to its target as long as the two are connected in the underlying graph after the link failures. Already in their seminal paper at ACM PODC '12, Feigenbaum, Godfrey, Panda, Schapira, Shenker, and Singla showed that perfect resilience cannot always be achieved. While the design of local rerouting algorithms has received much attention since then, we still lack a detailed understanding of when perfect resilience is achievable. This paper closes this gap and presents a complete characterization of when perfect resilience can be achieved. This characterization also allows us to design an $O(n)$-time algorithm to decide whether a given instance is perfectly resilient and an $O(nm)$-time algorithm to compute perfectly resilient rerouting rules whenever it is. Our algorithm is also attractive for the simple structure of the rerouting rules it uses, known as skipping in the literature: alternative links are chosen according to an ordered priority list (per in-port), where failed links are simply skipped. Intriguingly, our result also implies that in the context of perfect resilience, skipping rerouting rules are as powerful as more general rerouting rules. This partially answers a long-standing open question by Chiesa, Nikolaevskiy, Mitrovic, Gurtov, Madry, Schapira, and Shenker [IEEE/ACM Transactions on Networking, 2017] in the affirmative.

Why we are recommending this paper?
Due to your Interest in Machine Learning Resilience

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

Machine Learning Validation
MLOps

You can edit or add more interests any time.

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback