Hi!

Your personalized paper recommendations for 26 to 30 January, 2026.

Developers in the Age of AI: Adoption, Policy, and Diffusion of AI Software Engineering Tools

University of Virginia

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Future use intent is associated with high-confidence factors: perceived productivity in both coding and testing, and perceived quality. (ML: 0.99)👍👎
The study found a statistically significant positive correlation between perceived productivity and perceived quality, challenging the Quality Paradox. (ML: 0.98)👍👎
The study identified three distinct developer archetypes based on survey responses: Enthusiasts, Pragmatists, and Cautious. (ML: 0.97)👍👎
AI4SE: Artificial Intelligence for Software Engineering TAM: Technology Acceptance Model PQI: Perceived Quality Index HP: Hypothesis The study provides a comprehensive view of the current state of AI adoption in professional software development from the perspective of Developers. (ML: 0.97)👍👎
Nearly 75% of developers reported a net benefit to code quality from using AI tools, with the strongest impact on PQI through ubiquitous use of AI in coding and testing activities. (ML: 0.97)👍👎
Developers are extensively using AI tools and features in their development work and are perceiving productivity gains from that use. (ML: 0.97)👍👎

Abstract
The rapid advance of Generative AI into software development prompts this empirical investigation of perceptual effects on practice. We study the usage patterns of 147 professional developers, examining perceived correlates of AI tools use, the resulting productivity and quality outcomes, and developer readiness for emerging AI-enhanced development. We describe a virtuous adoption cycle where frequent and broad AI tools use are the strongest correlates of both Perceived Productivity (PP) and quality, with frequency strongest. The study finds no perceptual support for the Quality Paradox and shows that PP is positively correlated with Perceived Code Quality (PQ) improvement. Developers thus report both productivity and quality gains. High current usage, breadth of application, frequent use of AI tools for testing, and ease of use correlate strongly with future intended adoption, though security concerns remain a moderate and statistically significant barrier to adoption. Moreover, AI testing tools' adoption lags that of coding tools, opening a Testing Gap. We identify three developer archetypes (Enthusiasts, Pragmatists, Cautious) that align with an innovation diffusion process wherein the virtuous adoption cycle serves as the individual engine of progression. Our findings reveal that organizational adoption of AI tools follows such a process: Enthusiasts push ahead with tools, creating organizational success that converts Pragmatists. The Cautious are held in organizational stasis: without early adopter examples, they don't enter the virtuous adoption cycle, never accumulate the usage frequency that drives intent, and never attain high efficacy. Policy itself does not predict individuals' intent to increase usage but functions as a marker of maturity, formalizing the successful diffusion of adoption by Enthusiasts while acting as a gateway that the Cautious group has yet to reach.

Why we are recommending this paper?
Due to your Interest in Data Science Development Environment and Productivity

This paper directly addresses the user's interest in Data Science Development Tools and MLOps, exploring the adoption and impact of AI-powered tools within software development. Understanding developer behavior in this evolving landscape is crucial for effective MLOps strategies.

Achieving Productivity Gains with AI-based IDE features: A Journey at Google

Google

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

There are inter-dependencies between the items above, and to successfully and quickly land an improvement, there is often the need to make changes across multiple layers of the stack, as well as the need for an effective collaboration between people or teams involved. (ML: 0.98)👍👎
IDE: Integrated Development Environment SDLC: Software Development Lifecycle AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation ML-Enhanced Code Completion Improves Developer Productivity The authors believe that the discussion will help applied ML teams in the industry working on AI coding products with a holistic approach towards productivity improvements of software engineers. (ML: 0.97)👍👎
AI-powered software engineering features can significantly enhance developer productivity. (ML: 0.94)👍👎
The article does not provide a clear roadmap for implementing these features in other companies. (ML: 0.91)👍👎
The features discussed in this article are part of milestone 1, where AI acts as a pair programmer accelerating software engineers in some tasks. (ML: 0.90)👍👎
Long-running asynchronous agents pose novel challenges on IDE UX. (ML: 0.81)👍👎

Abstract
We discuss Google's journey in developing and refining two internal AI-based IDE features: code completion and natural-language-driven code transformation (Transform Code). We address challenges in latency, user experience and suggestion quality, all backed by rigorous experimentation. The article serves as an example of how to refine AI developer tools across the user interface, backend, and model layers, to deliver tangible productivity improvements in an enterprise setting.

Why we are recommending this paper?
Due to your Interest in Data Science Development Environment and Productivity

Coming from Google, this paper offers insights into practical applications of AI in IDEs, aligning with the user's interest in improving Data Science Development Environment and Productivity. The focus on experimentation and challenges is highly relevant to MLOps.

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Ant Group

Rate paper: 👍 👎 ♥ Save

AI Insights

The method relies on the availability of high-quality LLMs as agents, which may not be feasible for all users. (ML: 0.98)👍👎
The proposed method, LLM-AutoDP, uses large language models (LLMs) as agents to iteratively generate and refine data processing (DP) strategies for fine-tuning LLMs without exposing raw training data. (ML: 0.95)👍👎
Strategy Pruning: A novel technique that eliminates redundant or low-performing strategies from consideration, further accelerating the evaluation phase. (ML: 0.94)👍👎
Parallelization: A technique used to evaluate multiple strategies in parallel, reducing the overall computation time. (ML: 0.94)👍👎
Strategy Sampling: A technique used to reduce the number of strategies evaluated during each iteration, thereby accelerating the evaluation phase. (ML: 0.90)👍👎
LLM-AutoDP: A method that uses large language models as agents to iteratively generate and refine DP strategies for fine-tuning LLMs without exposing raw training data. (ML: 0.90)👍👎
LLM-AutoDP is an effective method for automating data processing for LLM fine-tuning without exposing raw training data. (ML: 0.89)👍👎
Experiments across various datasets and models show that LLMs fine-tuned on data processed by our framework achieve over 80% win rate. (ML: 0.86)👍👎
Three key techniques are introduced to accelerate the computationally expensive evaluation phase: strategy sampling, parallelization, and a novel technique called 'strategy pruning'. (ML: 0.82)👍👎
The proposed techniques significantly reduce time consumption while preserving the effectiveness of the discovered strategies. (ML: 0.78)👍👎

Abstract
Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.

Why we are recommending this paper?
Due to your Interest in Model Monitoring

This paper tackles the critical issue of data processing within the Machine Learning Lifecycle, a key area of interest for the user. The use of LLM agents for DP directly addresses the need for efficient and scalable model fine-tuning.

Provably Reliable Classifier Guidance through Cross-entropy Error Control

Cornell University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Classifier guidance can fail to guarantee effective guidance if the classification error is not controlled. (ML: 0.98)👍👎
Fisher divergence: A measure of the difference between two probability distributions. (ML: 0.95)👍👎
Subsequent work has shown that score matching corresponds to minimizing the Fisher divergence over some function classes and in certain statistical models is closely related to maximum likelihood estimation. (ML: 0.94)👍👎
Maximum likelihood estimation: An approach to parameter estimation that maximizes the likelihood of observing the data given the model parameters. (ML: 0.94)👍👎
The development of score-based generative models has been a significant advancement in the field of machine learning, enabling the creation of powerful generative models that can be used for various applications. (ML: 0.92)👍👎
Further research is needed to fully understand the properties and limitations of these models, as well as their potential applications. (ML: 0.91)👍👎
Their development traces back to the work of Hyvärinen (2005) on score matching, which seeks to learn a target distribution p0(x) via estimating its score function ∇logp 0(x). (ML: 0.88)👍👎
Score-based generative models are a powerful class of generative models that leverage the principle of score matching. (ML: 0.87)👍👎
Score matching: A method for learning a target distribution p0(x) by estimating its score function ∇logp 0(x). (ML: 0.87)👍👎
Leveraging score matching, score-based generative models have emerged as a powerful class of generative models, attaining state-of-the-art results in various applications. (ML: 0.85)👍👎

Abstract
Classifier-guided diffusion models generate conditional samples by augmenting the reverse-time score with the gradient of a learned classifier, yet it remains unclear whether standard classifier training procedures yield effective diffusion guidance. We address this gap by showing that, under mild smoothness assumptions on the classifiers, controlling the cross-entropy error at each diffusion step also controls the error of the resulting guidance vectors: classifiers achieving conditional KL divergence $\varepsilon^2$ from the ground-truth conditional label probabilities induce guidance vectors with mean squared error $\widetilde{O}(d \varepsilon )$. Our result yields an upper bound on the sampling error under classifier guidance and bears resemblance to a reverse log-Sobolev-type inequality. Moreover, we show that the classifier smoothness assumption is essential, by constructing simple counterexamples demonstrating that, without it, control of the guidance vector can fail for almost all distributions. To our knowledge, our work establishes the first quantitative link between classifier training and guidance alignment, yielding both a theoretical foundation for classifier guidance and principled guidelines for classifier selection.

Why we are recommending this paper?
Due to your Interest in Machine Learning Validation

The paper's focus on classifier guidance and error control is directly relevant to the user's interest in Machine Learning Validation and Model Monitoring. It addresses a fundamental challenge in ensuring reliable inference.

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

University of Victoria

Rate paper: 👍 👎 ♥ Save

AI Insights

SLM (Small Language Model): A smaller language model that can be queried quickly but may not have sufficient capacity to solve complex tasks. (ML: 0.96)👍👎
Outlier analysis is performed using a consensus-based approach that leverages the agreement patterns across different hint sizes, and samples with high outlier counts are filtered out to refine the dataset quality. (ML: 0.95)👍👎
LLM (Large Language Model): A larger language model that has more capacity and can solve complex tasks but requires more time and resources to query. (ML: 0.95)👍👎
Shepherding outperforms both routing and cascading in terms of cost-effectiveness, with a significant reduction in LLM calls while maintaining high accuracy. (ML: 0.92)👍👎
Shepherding: A novel query routing strategy for large language models (LLMs) that combines the strengths of both routing and cascading strategies. (ML: 0.92)👍👎
The Shepherding system is a novel approach for efficient query routing in large language models (LLMs), which combines the strengths of both routing and cascading strategies. (ML: 0.89)👍👎
The Shepherding system is trained using a discretized token budget approach, where candidate hints are evaluated at multiple granularities via token-budgeted LLM calls to obtain supervision for the hint sizes. (ML: 0.87)👍👎
Routing: A query routing strategy where a router chooses between querying an SLM or an LLM based on a binary classification decision. (ML: 0.85)👍👎
Cascading: A query routing strategy where a router queries an LLM to obtain a completion, which is then used as input for the SLM. (ML: 0.83)👍👎
The system uses an exponential moving average (EMA) of parameters for evaluation and checkpoint selection, and employs WeightedRandomSampler during training to balance the minibatch distribution. (ML: 0.83)👍👎

Abstract
Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.

Why we are recommending this paper?
Due to your Interest in Online inference

This paper directly addresses the user’s interest in Online inference and Machine Learning Infrastructure by proposing a cost-efficient approach to LLM deployment. The focus on SLMs and routing aligns with optimizing inference costs.

Goal-Driven Adaptive Sampling Strategies for Machine Learning Models Predicting Fields

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Infill criterion: a measure used to determine which samples to select for refining the surrogate model. (ML: 0.99)👍👎
Active learning: a strategy that involves selecting samples to refine a surrogate model through sequential sampling. (ML: 0.98)👍👎
The study presents a framework for uncertainty quantification in engineering applications using surrogate models. (ML: 0.95)👍👎
The framework employs an active learning strategy that refines the surrogate model through sequential sampling. (ML: 0.95)👍👎
Uncertainty quantification: the process of estimating the uncertainty associated with predictions or outcomes in a system or process. (ML: 0.93)👍👎
Surrogate model: an approximation of a complex system or process that can be used for prediction and analysis. (ML: 0.90)👍👎
Three infill strategies are compared: variance-only, coupled variance and misfit, and JSD-based coupling. (ML: 0.90)👍👎
The results show that the adaptive sampling strategies outperform non-adaptive techniques with respect to all metrics consistently. (ML: 0.86)👍👎
The SEwMisfit and JSD schemes yield similar results, while the SE approach performs worse due to not accounting for both models simultaneously. (ML: 0.85)👍👎
Quasi-Monte Carlo (QMC) methods: a class of numerical integration techniques used to estimate statistical properties of a quantity of interest. (ML: 0.76)👍👎

Abstract
Machine learning models are widely regarded as a way forward to tackle multi-query challenges that arise once expensive black-box simulations such as computational fluid dynamics are investigated. However, ensuring the desired level of accuracy for a certain task at minimal computational cost, e.g. as few black-box samples as possible, remains a challenges. Active learning strategies are used for scalar quantities to overcome this challenges and different so-called infill criteria exists and are commonly employed in several scenarios. Even though needed in various field an extension of active learning strategies towards field predictions is still lacking or limited to very specific scenarios and/or model types. In this paper we propose an active learning strategy for machine learning models that are capable if predicting field which is agnostic to the model architecture itself. For doing so, we combine a well-established Gaussian process model for a scalar reference value and simultaneously aim at reducing the epistemic model error and the difference between scalar and field predictions. Different specific forms of the above-mentioned approach are introduced and compared to each other as well as only scalar-valued based infill. Results are presented for the NASA common research model for an uncertainty propagation task showcasing high level of accuracy at significantly smaller cost compared to an approach without active learning.

Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

An efficient, accurate, and interpretable machine learning method for computing probability of failure

Simon Fraser University

Rate paper: 👍 👎 ♥ Save

AI Insights

PPSVMG achieves robust accuracy comparable to other state-of-the-art classification methods and has significantly lower variance than direct Monte Carlo estimates. (ML: 0.92)👍👎
POF-Darts: Geometric adaptive sampling for probability of failure Gabriel edited set: A method for editing nearest neighbor decision rules Support Vector Machine (SVM): A machine learning algorithm for classification and regression tasks PPSVMG is a powerful tool for computing the probability of failure of complex systems, with robust accuracy and significantly lower variance than direct Monte Carlo estimates. (ML: 0.92)👍👎
ThePenalized Profile Support Vector Machine (PPSVMG) is a novel machine learning methodology for computing the probability of failure of complex systems. (ML: 0.91)👍👎
The method's ability to preserve geometric integrity and increase interpretability makes it an attractive choice for applications where understanding the decision boundary is crucial. (ML: 0.86)👍👎
The method builds an approximate decision boundary consisting of linear SVMs based on clusters of Gabriel neighbors, with a penalty term designed to preserve the geometry of the true boundary. (ML: 0.85)👍👎
PPSVMG uses POF-Darts sampling to strategically allocate sample points near the decision boundary, preserving geometric integrity and increasing interpretability. (ML: 0.84)👍👎
The method requires careful tuning of hyperparameters to achieve optimal results. (ML: 0.83)👍👎

Abstract
We introduce a novel machine learning method called the Penalized Profile Support Vector Machine based on the Gabriel edited set for the computation of the probability of failure for a complex system as determined by a threshold condition on a computer model of system behavior. The method is designed to minimize the number of evaluations of the computer model while preserving the geometry of the decision boundary that determines the probability. It employs an adaptive sampling strategy designed to strategically allocate points near the boundary determining failure and builds a locally linear surrogate boundary that remains consistent with its geometry by strategic clustering of training points. We prove two convergence results and we compare the performance of the method against a number of state of the art classification methods on four test problems. We also apply the method to determine the probability of survival using the Lotka--Volterra model for competing species.

Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle

MoCo: A One-Stop Shop for Model Collaboration Research

University of Washington

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Transfer learning: A technique where a pre-trained model is fine-tuned on a new task or dataset, rather than training from scratch. (ML: 0.97)👍👎
Model collaboration: The process of combining multiple language models to improve their performance and efficiency. (ML: 0.97)👍👎
The use of large-scale datasets and benchmarks will remain crucial for evaluating and comparing the performance of different models. (ML: 0.96)👍👎
Multi-task learning: A technique where a single model is trained on multiple tasks simultaneously, allowing it to learn shared representations and improve performance on each task. (ML: 0.96)👍👎
The use of large-scale datasets and benchmarks is becoming increasingly important for evaluating and comparing the performance of different models. (ML: 0.96)👍👎
Researchers are continuing to explore innovative approaches to improve the performance and efficiency of language models. (ML: 0.95)👍👎
Researchers are exploring various approaches to improve the performance and efficiency of language models, including ensemble methods, transfer learning, and multi-task learning. (ML: 0.94)👍👎
Ensemble methods: Techniques used to combine the predictions or outputs of multiple models to produce a more accurate result. (ML: 0.94)👍👎
The field of model collaboration is rapidly evolving with the development of new techniques and tools. (ML: 0.91)👍👎
The field of model collaboration has made significant progress in recent years, with the development of new techniques and tools. (ML: 0.89)👍👎

Abstract
Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.

Why we are recommending this paper?
Due to your Interest in Model Monitoring

Statsformer: Validated Ensemble Learning with LLM-Derived Semantic Priors

Stanford University

Rate paper: 👍 👎 ♥ Save

AI Insights

LLM-derived semantic priors can be used for validated ensemble learning The use of LLMs in automl has several challenges and opportunities Automl-agent: a multi-agent framework for full-pipeline automl Privileged information: additional information that is not available at test time Stacked generalization: a method for combining the predictions of multiple models Super learner: an ensemble method for combining the predictions of multiple models LLM-derived semantic priors can be used to improve the performance of automl systems The use of LLMs in automl has several challenges and opportunities, including the need for more research on their limitations and potential applications LLMs are weak learners The use of LLMs in automl has several challenges, including the need for more research on their limitations and potential applications (ML: 0.86)👍👎

Abstract
We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.

Why we are recommending this paper?
Due to your Interest in Machine Learning Validation

TinyTorch: Building Machine Learning Systems from First Principles

Harvard University

Rate paper: 👍 👎 ♥ Save

AI Insights

Milestones serve dual pedagogical and validation purposes, providing motivation through historical framing and demonstrating implementation correctness through real-world task performance. (ML: 0.98)👍👎
Each module concludes with systems reasoning prompts measuring conceptual understanding beyond syntactic correctness. (ML: 0.97)👍👎
Milestones are designed to be challenging but achievable, allowing students to demonstrate their understanding of complex concepts through real-world tasks. (ML: 0.96)👍👎
Assessment validates both isolated correctness and cross-module integration. (ML: 0.96)👍👎
The TinyTorch framework is designed for teaching machine learning concepts through hands-on implementation and analysis. (ML: 0.95)👍👎
Reflect: Systems Analysis Questions. (ML: 0.94)👍👎
TinyTorch follows a consistent Build-Use-Reflect cycle, integrating implementation, application, and systems reasoning to address multiple learning objectives. (ML: 0.94)👍👎
It's a pedagogical tool aimed at bridging the gap between theoretical understanding and practical application. (ML: 0.94)👍👎
Students implement components in Jupyter notebooks with scaffolded guidance. (ML: 0.91)👍👎
TinyTorch's design emphasizes systems thinking, encouraging students to analyze and understand the relationships between components, rather than just focusing on individual functions. (ML: 0.87)👍👎
The framework includes six historical milestones that recreate actual breakthroughs using exclusively student code, validating success through task-appropriate performance. (ML: 0.85)👍👎
The framework is built with a focus on explicit dependencies, making it easier for students to understand where each module fits in the larger architecture. (ML: 0.83)👍👎
Use: Integration Testing Beyond Unit Tests. (ML: 0.77)👍👎
Build: Implementation with Explicit Dependencies. (ML: 0.66)👍👎

Abstract
Machine learning education faces a fundamental gap: students learn algorithms without understanding the systems that execute them. They study gradient descent without measuring memory, attention mechanisms without analyzing O(N^2) scaling, optimizer theory without knowing why Adam requires 3x the memory of SGD. This "algorithm-systems divide" produces practitioners who can train models but cannot debug memory failures, optimize inference latency, or reason about deployment trade-offs--the very skills industry demands as "ML systems engineering." We present TinyTorch, a 20-module curriculum that closes this gap through "implementation-based systems pedagogy": students construct PyTorch's core components (tensors, autograd, optimizers, CNNs, transformers) in pure Python, building a complete framework where every operation they invoke is code they wrote. The design employs three patterns: "progressive disclosure" of complexity, "systems-first integration" of profiling from the first module, and "build-to-validate milestones" recreating 67 years of ML breakthroughs--from Perceptron (1958) through Transformers (2017) to MLPerf-style benchmarking. Requiring only 4GB RAM and no GPU, TinyTorch demonstrates that deep ML systems understanding is achievable without specialized hardware. The curriculum is available open-source at mlsysbook.ai/tinytorch.

Why we are recommending this paper?
Due to your Interest in Machine Learning Deployment

On the Computational Complexity of Performative Prediction

Carnegie Mellon University

Rate paper: 👍 👎 ♥ Save

AI Insights

Imagine you're trying to make a decision, but your actions affect the data that's available to you. (ML: 0.98)👍👎
The problem discusses the concept of performative prediction, which is a framework for analyzing the behavior of decision-makers in environments where their actions affect the underlying distribution of data. (ML: 0.97)👍👎
Performative prediction is a framework for analyzing decision-makers' behavior in environments where their actions affect the underlying distribution of data. (ML: 0.97)👍👎
The concept of performative prediction was introduced by Perdomo et al. (ML: 0.96)👍👎
This is called performative prediction. (ML: 0.95)👍👎
The problem discusses how hard it is to find a good solution in this situation and shows that it's related to another concept called Nash equilibria. (ML: 0.93)👍👎
The problem discusses the computational complexity of finding performative stable points and provides a PPAD-hardness proof for general convex sets. (ML: 0.88)👍👎
[2020] in a single-agent setting, but has since been extended to multi-player settings [Narang et al., 2023, G´ois et al., 2025]. (ML: 0.84)👍👎
The results imply a polynomial-time equivalence between the complexity of local performative optimality and pure Nash equilibria in potential games. (ML: 0.83)👍👎
The PPAD complexity class was introduced by Papadimitriou [1994] and characterizes the complexity of Nash equilibria in two-player general-sum games. (ML: 0.81)👍👎
Performative stability: A point x∗ ∈ X is (first-order) performatively stable if for all x ∈ X, it holds that ⟨x−x∗, Ez∼D(x∗)[∇xℓ(x∗;z)]⟩ ≥ 0. (ML: 0.81)👍👎
The problem does not provide a clear explanation of how to compute a performatively stable point, and the PPAD-hardness proof is complex. (ML: 0.79)👍👎

Abstract
Performative prediction captures the phenomenon where deploying a predictive model shifts the underlying data distribution. While simple retraining dynamics are known to converge linearly when the performative effects are weak ($ρ< 1$), the complexity in the regime $ρ> 1$ was hitherto open. In this paper, we establish a sharp phase transition: computing an $ε$-performatively stable point is PPAD-complete -- and thus polynomial-time equivalent to Nash equilibria in general-sum games -- even when $ρ= 1 + O(ε)$. This intractability persists even in the ostensibly simple setting with a quadratic loss function and linear distribution shifts. One of our key technical contributions is to extend this PPAD-hardness result to general convex domains, which is of broader interest in the complexity of variational inequalities. Finally, we address the special case of strategic classification, showing that computing a strategic local optimum is PLS-hard.

Why we are recommending this paper?
Due to your Interest in Machine Learning Operations

"ENERGY STAR" LLM-Enabled Software Engineering Tools

University of Colorado Colorado Springs UCCS

Rate paper: 👍 👎 ♥ Save

AI Insights

CodeCarbon: A library used for estimating and tracking carbon emissions from machine learning computing. (ML: 0.93)👍👎
The study's experimental results show that the impacts of RAG pipelines varied across the studied LLMs, with CodeLlama experiencing 25% faster inference times and substantial quality improvements. (ML: 0.92)👍👎
CodeLlama achieved 25% faster inference times and substantial quality improvements with RAG, while smaller models like GPT-2 showed mixed efficiency results despite modest energy savings. (ML: 0.91)👍👎
Prompt Engineering Techniques (PETs): The process of designing and optimizing prompts to improve the performance and energy efficiency of LLMs. (ML: 0.90)👍👎
Large Language Models (LLMs): Deep learning models that can understand, generate, and translate human language. (ML: 0.90)👍👎
The use of Retrieval-Augmented Generation (RAG) pipelines can reduce energy consumption in Large Language Model (LLM)-based code generation, but the impact varies across different LLM architectures. (ML: 0.88)👍👎
The study highlights the importance of well-designed prompts in reducing LLMs' energy consumption, with Rubei et al.'s findings confirmed that optimal prompt configurations can reduce energy usage by up to 99%. (ML: 0.87)👍👎
RAG can help smaller, more efficient models achieve competitive code generation quality, as demonstrated by GPT-2 on the Kaggle dataset matching DeepSeek Coder's performance while using approximately 3.5x less energy. (ML: 0.86)👍👎
Retrieval-Augmented Generation (RAG): A pipeline that combines retrieval and generation mechanisms to enhance the quality and efficiency of LLM-based code generation. (ML: 0.85)👍👎
There is no clear relationship between model size and achieving any RAG-based energy efficiency benefits, as only GPT-2 (the smallest in size) and CodeLlama showed energy reduction with RAG. (ML: 0.82)👍👎

Abstract
The discussion around AI-Engineering, that is, Software Engineering (SE) for AI-enabled Systems, cannot ignore a crucial class of software systems that are increasingly becoming AI-enhanced: Those used to enable or support the SE process, such as Computer-Aided SE (CASE) tools and Integrated Development Environments (IDEs). In this paper, we study the energy efficiency of these systems. As AI becomes seamlessly available in these tools and, in many cases, is active by default, we are entering a new era with significant implications for energy consumption patterns throughout the Software Development Lifecycle (SDLC). We focus on advanced Machine Learning (ML) capabilities provided by Large Language Models (LLMs). Our proposed approach combines Retrieval-Augmented Generation (RAG) with Prompt Engineering Techniques (PETs) to enhance both the quality and energy efficiency of LLM-based code generation. We present a comprehensive framework that measures real-time energy consumption and inference time across diverse model architectures ranging from 125M to 7B parameters, including GPT-2, CodeLlama, Qwen 2.5, and DeepSeek Coder. These LLMs, chosen for practical reasons, are sufficient to validate the core ideas and provide a proof of concept for more in-depth future analysis.

Why we are recommending this paper?
Due to your Interest in Data Science Development Tools

Educational Database Prototype: the Simplest of All

University of WisconsinMadison

Rate paper: 👍 👎 ♥ Save

AI Insights

A generic workflow for EduDB's benchmark has been introduced to evaluate and rank students' optimizations. (ML: 0.93)👍👎
EduDB: A simple database prototype designed for educational purposes. (ML: 0.92)👍👎
Parser: The component responsible for parsing SQL queries into an abstract syntax tree (AST). (ML: 0.92)👍👎
EduDB is a simple database prototype designed for educational purposes. (ML: 0.91)👍👎
The system consists of several components, including the parser, query executor, buffer manager, file manager, concurrency manager, and transaction manager. (ML: 0.89)👍👎
Query Executor: The component that executes the parsed query on the database. (ML: 0.84)👍👎
File Manager: Responsible for managing files and directories on disk. (ML: 0.81)👍👎
Buffer Manager: Manages the buffer cache, which stores frequently accessed data in memory. (ML: 0.80)👍👎
Concurrency Manager: Ensures that multiple transactions can be executed concurrently without conflicts. (ML: 0.79)👍👎
Transaction Manager: Manages the execution of transactions, including commit and rollback operations. (ML: 0.76)👍👎

Abstract
Database Management System (DBMS) is designed to help store and process large collections of data, and is incredibly flexible to perform various kinds of optimizations as long as it achieves serializability with a high-level interface available. The current undergraduate level DBMS course in UW-Madison (i.e., CS564) involves implementing specific modules of DB architecture, including B+ tree, but students may end up spending numerous amounts of effort on corner cases and not gaining a more comprehensive understanding of the internal design. Thus, we present EduDB, a simple database prototype for educational purposes that provides students a clean, concise, and comprehensive overview of the database system. We also attempt to develop an integrative series of course projects based on EduDB, which offers a platform for students to perform any optimization learned during the semester.

Why we are recommending this paper?
Due to your Interest in Data Science Development Tools

Distributional Active Inference

University of Southern Denmark

Rate paper: 👍 👎 ♥ Save

AI Insights

Active Inference: A framework for understanding the brain's decision-making process that involves updating internal models based on sensory input and prior knowledge. (ML: 0.98)👍👎
Active inference is a framework for understanding the brain's decision-making process. (ML: 0.97)👍👎
Active inference is a powerful tool for understanding decision-making in complex environments, and its application to DRL has shown promising results. (ML: 0.97)👍👎
Active inference can be used to learn policies in complex environments, such as robotics or autonomous driving. (ML: 0.95)👍👎
The free-energy principle has been applied to various fields, including neuroscience, psychology, and artificial intelligence. (ML: 0.95)👍👎
Free-Energy Principle: A mathematical framework that describes how the brain updates its internal model of the world based on sensory input and prior knowledge. (ML: 0.93)👍👎
The free-energy principle provides a unifying framework for understanding how the brain updates its internal model of the world based on sensory input and prior knowledge. (ML: 0.93)👍👎
Distributional Reinforcement Learning (DRL): An extension of traditional reinforcement learning that takes into account the distribution of rewards rather than just their expected value. (ML: 0.92)👍👎
Distributional reinforcement learning (DRL) is an extension of traditional reinforcement learning that takes into account the distribution of rewards rather than just their expected value. (ML: 0.89)👍👎
The free-energy principle is a mathematical framework that describes how the brain updates its internal model of the world based on sensory input and prior knowledge. (ML: 0.89)👍👎

Abstract
Optimal control of complex environments with robotic systems faces two complementary and intertwined challenges: efficient organization of sensory state information and far-sighted action planning. Because the reinforcement learning framework addresses only the latter, it tends to deliver sample-inefficient solutions. Active inference is the state-of-the-art process theory that explains how biological brains handle this dual problem. However, its applications to artificial intelligence have thus far been limited to extensions of existing model-based approaches. We present a formal abstraction of reinforcement learning algorithms that spans model-based, distributional, and model-free approaches. This abstraction seamlessly integrates active inference into the distributional reinforcement learning framework, making its performance advantages accessible without transition dynamics modeling.

Why we are recommending this paper?
Due to your Interest in Online inference

Demystifying Prediction Powered Inference

Columbia Mailman School of Public Health

Rate paper: 👍 👎 ♥ Save

AI Insights

Double-dipping (using overlapping data for both training and inference) can induce bias and overconfidence in PPI-based methods. (ML: 0.97)👍👎
Violating these assumptions can lead to biased or invalid estimates. (ML: 0.97)👍👎
The performance of PPI-based methods can be sensitive to the quality of the prediction model and the underlying data distribution. (ML: 0.97)👍👎
PPI (Prediction-based Inference) is a method that leverages prediction models to improve the efficiency of statistical inference. (ML: 0.94)👍👎
PPI-type estimators are not robust to MNAR (Missing Not At Random) mechanisms, where the probability of being labeled depends on the outcome itself. (ML: 0.93)👍👎
PPI requires three key assumptions: MCAR, MAR, and SUTVA. (ML: 0.93)👍👎
MCAR: Missing Completely At Random MAR: Missing At Random SUTVA: Stable Unit Treatment Value Assumption MNAR: Missing Not At Random PPI is a powerful tool for improving the efficiency of statistical inference, but its performance can be sensitive to the quality of the prediction model and the underlying data distribution. (ML: 0.92)👍👎

Abstract
Machine learning predictions are increasingly used to supplement incomplete or costly-to-measure outcomes in fields such as biomedical research, environmental science, and social science. However, treating predictions as ground truth introduces bias while ignoring them wastes valuable information. Prediction-Powered Inference (PPI) offers a principled framework that leverages predictions from large unlabeled datasets to improve statistical efficiency while maintaining valid inference through explicit bias correction using a smaller labeled subset. Despite its potential, the growing PPI variants and the subtle distinctions between them have made it challenging for practitioners to determine when and how to apply these methods responsibly. This paper demystifies PPI by synthesizing its theoretical foundations, methodological extensions, connections to existing statistics literature, and diagnostic tools into a unified practical workflow. Using the Mosaiks housing price data, we show that PPI variants produce tighter confidence intervals than complete-case analysis, but that double-dipping, i.e. reusing training data for inference, leads to anti-conservative confidence intervals and coverages. Under missing-not-at-random mechanisms, all methods, including classical inference using only labeled data, yield biased estimates. We provide a decision flowchart linking assumption violations to appropriate PPI variants, a summary table of selective methods, and practical diagnostic strategies for evaluating core assumptions. By framing PPI as a general recipe rather than a single estimator, this work bridges methodological innovation and applied practice, helping researchers responsibly integrate predictions into valid inference.

Why we are recommending this paper?
Due to your Interest in Machine Learning Infrastructure

Position: Certifiable State Integrity in Cyber-Physical Systems -- Why Modular Sovereignty Solves the Plasticity-Stability Paradox

UFSC Universidade Federal de Santa Catarina

Rate paper: 👍 👎 ♥ Save

AI Insights

Modular Sovereignty: The ability of an AI system to understand its own limitations and adapt to changing situations. (ML: 0.93)👍👎
The paper discusses the concept of Modular Sovereignty in AI systems, which refers to the ability of a system to understand its own limitations and adapt to changing situations. (ML: 0.90)👍👎
Kolmogorov-Arnold Networks (KANs): A type of neural network that uses interpretable univariate splines on edges rather than fixed activation functions on nodes. (ML: 0.88)👍👎
The paper highlights the importance of auditability and proposes a new approach using KANs to enhance auditability in AI systems. (ML: 0.88)👍👎
The paper highlights the importance of auditability in AI systems, particularly in safety-critical applications, and proposes a new approach using Kolmogorov-Arnold Networks (KANs) to enhance auditability. (ML: 0.86)👍👎
Neural Operators: A type of neural network that can be used to model complex physical systems and simulate their behavior. (ML: 0.83)👍👎
The authors conclude that HYDRA provides a promising approach to achieving modular sovereignty in AI systems, particularly in safety-critical applications. (ML: 0.79)👍👎
HYDRA is designed to handle regime shifts, where the underlying physics changes suddenly, and the system must adapt quickly to maintain stability. (ML: 0.69)👍👎
The authors propose a new architecture called HYDRA that combines polytopic Linear Parameter-Varying (LPV) theory with neural operators to achieve modular sovereignty. (ML: 0.68)👍👎
Polytopic Linear Parameter-Varying (LPV) theory: A mathematical framework for modeling systems with multiple operating modes or regimes. (ML: 0.58)👍👎

Abstract
The machine learning community has achieved remarkable success with universal foundation models for time-series and physical dynamics, largely overcoming earlier approximation barriers in smooth or slowly varying regimes through scale and specialized architectures. However, deploying these monolithic models in safety-critical Cyber-Physical Systems (CPS), governed by non-stationary lifecycle dynamics and strict reliability requirements, reveals persistent challenges. Recent evidence shows that fine-tuning time-series foundation models induces catastrophic forgetting, degrading performance on prior regimes. Standard models continue to exhibit residual spectral bias, smoothing high-frequency discontinuities characteristic of incipient faults, while their opacity hinders formal verification and traceability demanded by safety standards (e.g., ISO 26262, IEC 61508). This position paper argues that the plasticity-stability paradox cannot be fully resolved by global parameter updates (whether via offline fine-tuning or online adaptation). Instead, we advocate a Modular Sovereignty paradigm: a library of compact, frozen regime-specific specialists combined via uncertainty-aware blending, which we term "HYDRA" (Hierarchical uncertaintY-aware Dynamics for Rapidly-Adapting systems). This paradigm ensures regime-conditional validity, rigorous disentanglement of aleatoric and epistemic uncertainties, and modular auditability, offering a certifiable path for robust state integrity across the CPS lifecycle.

Why we are recommending this paper?
Due to your Interest in Fault tolerance

Computer Science Challenges in Quantum Computing: Early Fault-Tolerance and Beyond

UCLA

Rate paper: 👍 👎 ♥ Save

AI Insights

Benchmarks and metrics play a central role in evaluating progress in QEC. (ML: 0.96)👍👎
Logical error rate: the probability of an error occurring in a logical qubit, which can affect the accuracy of quantum computations. (ML: 0.89)👍👎
The field requires collaboration between computer scientists, mathematicians, physicists, and device engineers. (ML: 0.81)👍👎
Distributed system: a system composed of multiple interconnected nodes or processors, allowing for parallel processing and scalability. (ML: 0.81)👍👎
Quantum error correction: the process of detecting and correcting errors that occur during quantum computations. (ML: 0.78)👍👎
Modular system: a system designed to be composed of interchangeable modules or components, enabling flexibility and adaptability. (ML: 0.78)👍👎
Quantum error correction (QEC) is a crucial component for the development of large-scale quantum computers. (ML: 0.70)👍👎
Distributed and modular systems are essential for scaling up QEC due to constraints associated with power, cooling, and I/O. (ML: 0.69)👍👎
Quantum design automation (QDA) is necessary for optimizing programs for complex parameters associated with applications and physical systems. (ML: 0.58)👍👎
Quantum design automation (QDA): the use of software tools and techniques to optimize and automate the design process for quantum systems. (ML: 0.57)👍👎

Abstract
Quantum computing is entering a period in which progress will be shaped as much by advances in computer science as by improvements in hardware. The central thesis of this report is that early fault-tolerant quantum computing shifts many of the primary bottlenecks from device physics alone to computer-science-driven system design, integration, and evaluation. While large-scale, fully fault-tolerant quantum computers remain a long-term objective, near- and medium-term systems will support early fault-tolerant computation with small numbers of logical qubits and tight constraints on error rates, connectivity, latency, and classical control. How effectively such systems can be used will depend on advances across algorithms, error correction, software, and architecture. This report identifies key research challenges for computer scientists and organizes them around these four areas, each centered on a fundamental question.

Why we are recommending this paper?
Due to your Interest in Fault tolerance

Test-Time Compute Games

Max Planck Institute for Software Systems

Rate paper: 👍 👎 ♥ Save

AI Insights

The current pricing approaches in markets of LLMs-as-a-service incentivize providers to set their level of test-time compute in a way that is not aligned with social welfare. (ML: 0.96)👍👎
LLM (Large Language Model): a type of artificial intelligence model that can process and generate human-like language. (ML: 0.94)👍👎
test-time compute games game-theoretic model LLM providers strategically select test-time compute level social welfare not aligned with provider incentives forward-looking market based on reverse second-price auction aligns provider incentives with social welfare test-time compute game: a game-theoretic model in which LLM providers compete to serve user queries and maximize their profit by strategically selecting the level of test-time compute used by their model. (ML: 0.92)👍👎
A forward-looking market based on a reverse second-price auction can align provider incentives with social welfare. (ML: 0.88)👍👎

Abstract
Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.

Why we are recommending this paper?
Due to your Interest in Machine Learning Testing

An Empirical Evaluation of Modern MLOps Frameworks

University of the Basque Country UPVEHU

Rate paper: 👍 👎 ♥ Save

AI Insights

MLOps: Machine Learning Operations, an emerging field that focuses on the operationalization of machine learning models in production environments. (ML: 0.95)👍👎
MLflow: An open-source platform for managing the end-to-end machine learning lifecycle, from model development to deployment. (ML: 0.93)👍👎
Kubeflow Pipelines: An open-source platform for building and deploying machine learning pipelines, based on the Kubeflow framework. (ML: 0.91)👍👎
MLflow is a strong contender for MLOps due to its comprehensive documentation, ease of installation, and flexibility in configuration. (ML: 0.88)👍👎
Metaflow: A Python library for building and managing data science workflows, with a focus on reproducibility and collaboration. (ML: 0.88)👍👎
Metaflow offers a natural instrumentation approach that minimizes intrusiveness and facilitates porting of existing models. (ML: 0.82)👍👎
Apache Airflow: A popular open-source workflow management system that can be used in MLOps scenarios. (ML: 0.82)👍👎
MLflow provides a comprehensive and well-structured official documentation, covering installation and basic use of the tracking server as well as model management in production. (ML: 0.80)👍👎
Kubeflow Pipelines requires defining each component as a Docker container or as a DSL-decorated function, which can be intrusive and add significant overhead. (ML: 0.66)👍👎
Metaflow offers a very natural instrumentation centered on the experiment's logical flow, with each step defined with the @step decorator. (ML: 0.62)👍👎

Abstract
Given the increasing adoption of AI solutions in professional environments, it is necessary for developers to be able to make informed decisions about the current tool landscape. This work empirically evaluates various MLOps (Machine Learning Operations) tools to facilitate the management of the ML model lifecycle: MLflow, Metaflow, Apache Airflow, and Kubeflow Pipelines. The tools are evaluated by assessing the criteria of Ease of installation, Configuration flexibility, Interoperability, Code instrumentation complexity, result interpretability, and Documentation when implementing two common ML scenarios: Digit classifier with MNIST and Sentiment classifier with IMDB and BERT. The evaluation is completed by providing weighted results that lead to practical conclusions on which tools are best suited for different scenarios.

Why we are recommending this paper?
Due to your Interest in MLOps

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

Machine Learning Resilience

You can edit or add more interests any time.

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback