Hi!
Your personalized paper recommendations for 05 to 09 January, 2026.
QpiAI
Abstract
Multi-step reasoning remains a key challenge for Large Language Models (LLMs), particularly in complex domains such as mathematics and creative writing. While recent approaches including ReAct, Reflexion, and Self-Refine improve reasoning through iterative refinement and reflection, they often lack structured exploration of alternative solution paths and persistent learning across problems. We propose ReTreVal (Reasoning Tree with Validation), a hybrid framework that integrates Tree-of-Thoughts exploration, self-refinement, LLM-based critique scoring, and reflexion memory to enable bounded and validated multi-step reasoning. ReTreVal constructs a structured reasoning tree with adaptive depth based on problem complexity, where each node undergoes iterative self-critique and refinement guided by explicit LLM-generated feedback. A dual validation mechanism evaluates reasoning quality, coherence, and correctness at each node while persistently storing insights from successful reasoning paths and failure patterns in a reflexion memory buffer, enabling cross-problem learning. Critique-based pruning retains only the top-k highest-scoring nodes at each level, controlling computational cost while preserving high-quality solution paths. We evaluate ReTreVal against ReAct, Reflexion, and Self-Refine across 500 mathematical problems and creative writing tasks using Qwen 2.5 7B as the underlying LLM, and demonstrate that ReTreVal consistently outperforms existing methods through its combination of structured exploration, critique-driven refinement, and cross-problem memory, making it particularly effective for tasks requiring exploratory reasoning, rigorous verification, and knowledge transfer.
Why we are recommending this paper?
Due to your Interest in Machine Learning Validation
This paper directly addresses the challenges of multi-step reasoning in Large Language Models, a key area of interest for the user. The proposed ReTreVal framework offers a potentially valuable approach to improving LLM performance, aligning with the user's focus on MLOps and LLM capabilities.
MIT
Abstract
While probabilistic graphical models can discover latent structure in data, their effectiveness hinges on choosing well-specified models. Identifying such models is challenging in practice, often requiring iterative checking and revision through trial and error. To this end, we propose meta-probabilistic modeling (MPM), a meta-learning algorithm that learns generative model structure directly from multiple related datasets. MPM uses a hierarchical architecture where global model specifications are shared across datasets while local parameters remain dataset-specific. For learning and inference, we propose a tractable VAE-inspired surrogate objective, and optimize it through bi-level optimization: local variables are updated analytically via coordinate ascent, while global parameters are trained with gradient-based methods. We evaluate MPM on object-centric image modeling and sequential text modeling, demonstrating that it adapts generative models to data while recovering meaningful latent representations.
Why we are recommending this paper?
Due to your Interest in Model Monitoring
Given the user’s interest in Machine Learning Validation and Model Monitoring, this paper’s focus on iterative model checking and revision is highly relevant. The approach to identifying well-specified probabilistic models directly supports the user’s need for robust model validation techniques.
Universit Cote dAzur
Abstract
Background: Extracting the stages that structure Machine Learning (ML) pipelines from source code is key for gaining a deeper understanding of data science practices. However, the diversity caused by the constant evolution of the ML ecosystem (e.g., algorithms, libraries, datasets) makes this task challenging. Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain. These limitations highlight the need for more flexible and reliable solutions.
Objective: We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations, and subsequently how they can advance our understanding of data science practices.
Method: We conduct a confirmatory study based on two reference works selected for their relevance regarding current state-of-the-art's limitations. First, we compare several SLMs using Cochran's Q test. The best-performing model is then evaluated against the reference studies using two distinct McNemar's tests. We further analyze how variations in taxonomy definitions affect performance through an additional Cochran's Q test. Finally, a goodness-of-fit analysis is conducted using Pearson's chi-squared tests to compare our insights on data science practices with those from prior studies.
Why we are recommending this paper?
Due to your Interest in Machine Learning Deployment
This research aligns with the user’s interest in Data Science Development Tools and MLOps, specifically concerning understanding and managing ML pipelines. The ability to extract pipeline structures from source code is a critical component of effective MLOps practices.
University of California
Abstract
Bayesian inference is a popular approach to calibrating uncertainties, but it can underpredict such uncertainties when model misspecification is present, impacting its reliability to inform decision making. Recently, the statistics and machine learning communities have developed prediction-oriented inference approaches that provide better calibrated uncertainties and adapt to the level of misspecification present. However, these approaches have yet to be demonstrated in the context of complex scientific applications where phenomena of interest are governed by physics-based models. Such settings often involve single realizations of high-dimensional spatio-temporal data and nonlinear, computationally expensive parameter-to-observable maps. This work investigates variational prediction-oriented inference in problems exhibiting these relevant features; namely, we consider a polynomial model and a contaminant transport problem governed by advection-diffusion equations. The prediction-oriented loss is formulated as the log-predictive probability of the calibration data. We study the effects of increasing misspecification and noise, and we assess approximations of the predictive density using Monte Carlo sampling and component-wise kernel density estimation. A novel aspect of this work is applying prediction-oriented inference to the calibration of model-form uncertainty (MFU) representations, which are embedded physics-based modifications to the governing equations that aim to reduce (but rarely eliminate) model misspecification. The computational results demonstrate that prediction-oriented frameworks can provide better uncertainty characterizations in comparison to standard inference while also being amenable to the calibration of MFU representations.
Why we are recommending this paper?
Due to your Interest in Machine Learning Validation
The paper’s focus on uncertainty characterization and calibration is directly relevant to the user’s interest in Machine Learning Validation and Fault Tolerance. Addressing model misspecification is a crucial aspect of ensuring reliable inference and decision-making.
Shanghai Jiao Tong University
Abstract
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
Why we are recommending this paper?
Due to your Interest in Online inference
This paper tackles the performance challenges of Large Reasoning Models, a key area for the user’s interest in Machine Learning Infrastructure and MLOps. The collaborative inference approach offers a promising solution for reducing latency and computational cost, aligning with the user’s focus on efficient inference.
Massey University
Abstract
Machine learning models are increasingly used in high-stakes domains where their predictions can actively shape the environments in which they operate, a phenomenon known as performative prediction. This dynamic, in which the deployment of the model influences the very outcome it seeks to predict, can lead to unintended consequences, including feedback loops, performance issues, and significant societal risks. While the literature in the field has grown rapidly in recent years, a socio-technical synthesis that systemises the phenomenon concepts and provides practical guidance has been lacking. This Systematisation of Knowledge (SoK) addresses this gap by providing a comprehensive review of the literature on performative predictions. We provide an overview of the primary mechanisms through which performativity manifests, present a typology of associated risks, and survey the proposed solutions offered in the literature. Our primary contribution is the ``Performative Strength vs. Impact Matrix" assessment framework. This practical tool is designed to help practitioners assess the potential influence and severity of performativity on their deployed predictive models and select the appropriate level of algorithmic or human intervention.
Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle
Massachusetts Institute of Technology MIT
Abstract
Liquid metals are central to energy-storage and nuclear technologies, yet quantitative knowledge of their thermophysical properties remains limited. While atomistic simulations offer a route to computing liquid properties directly from atomic motion, the most accurate approach, ab initio molecular dynamics (AIMD), is computationally costly and restricted to short time and length scales. Machine learning interatomic potentials (MLPs) offer AIMD accuracy at far lower cost, but their application to liquids is limited by training datasets that inadequately sample atomic configurations, leading to unphysical force predictions and unstable trajectories. Here we introduce a physically motivated dataset-engineering strategy that constructs liquidlike training data synthetically rather than relying on AIMD configurations. The method exploits the established icosahedral short-range order of metallic liquids, twelvefold, near-close-packed local coordination, and generates "synthetic-liquid" structures by systematic perturbation of crystalline references. MLPs trained on these datasets close the sampling gaps that lead to unphysical predictions, remain numerically stable across temperatures, and reproduce experimental liquid densities, diffusivities, and melting temperatures for multiple elemental metals. The framework links atomic-scale sampling to long-term MD stability and provides a practical route to predictive modeling of liquid-phase thermophysical behavior beyond the limits of direct AIMD.
Why we are recommending this paper?
Due to your Interest in Machine Learning Lifecycle
NTT, Inc
Abstract
One of the most important queries in knowledge compilation is weighted model counting (WMC), which has been applied to probabilistic inference on various models, such as Bayesian networks. In practical situations on inference tasks, the model's parameters have uncertainty because they are often learned from data, and thus we want to compute the degree of uncertainty in the inference outcome. One possible approach is to regard the inference outcome as a random variable by introducing distributions for the parameters and evaluate the variance of the outcome. Unfortunately, the tractability of computing such a variance is hardly known. Motivated by this, we consider the problem of computing the variance of WMC and investigate this problem's tractability. First, we derive a polynomial time algorithm to evaluate the WMC variance when the input is given as a structured d-DNNF. Second, we prove the hardness of this problem for structured DNNFs, d-DNNFs, and FBDDs, which is intriguing because the latter two allow polynomial time WMC algorithms. Finally, we show an application that measures the uncertainty in the inference of Bayesian networks. We empirically show that our algorithm can evaluate the variance of the marginal probability on real-world Bayesian networks and analyze the impact of the variances of parameters on the variance of the marginal.
Why we are recommending this paper?
Due to your Interest in Model Monitoring
DP Technology
Abstract
Open-source scientific software is abundant, yet most tools remain difficult to compile, configure, and reuse, sustaining a small-workshop mode of scientific computing. This deployment bottleneck limits reproducibility, large-scale evaluation, and the practical integration of scientific tools into modern AI-for-Science (AI4S) and agentic workflows.
We present Deploy-Master, a one-stop agentic workflow for large-scale tool discovery, build specification inference, execution-based validation, and publication. Guided by a taxonomy spanning 90+ scientific and engineering domains, our discovery stage starts from a recall-oriented pool of over 500,000 public repositories and progressively filters it to 52,550 executable tool candidates under license- and quality-aware criteria. Deploy-Master transforms heterogeneous open-source repositories into runnable, containerized capabilities grounded in execution rather than documentation claims. In a single day, we performed 52,550 build attempts and constructed reproducible runtime environments for 50,112 scientific tools. Each successful tool is validated by a minimal executable command and registered in SciencePedia for search and reuse, enabling direct human use and optional agent-based invocation.
Beyond delivering runnable tools, we report a deployment trace at the scale of 50,000 tools, characterizing throughput, cost profiles, failure surfaces, and specification uncertainty that become visible only at scale. These results explain why scientific software remains difficult to operationalize and motivate shared, observable execution substrates as a foundation for scalable AI4S and agentic science.
Why we are recommending this paper?
Due to your Interest in Machine Learning Deployment
University of CaliforniaSan Diego
Abstract
We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees induce rule-based, axis-aligned decision regions (finite unions of boxes), whereas shallow ReLU networks are typically trained as score models whose predictions are obtained by thresholding. We analyze the infinite-width, bounded-norm, single-hidden-layer ReLU class through the Radon total variation ($\mathrm{R}\mathrm{TV}$) seminorm, which controls the geometric complexity of level sets.
We first show that the hard tree indicator $1_A$ has infinite $\mathrm{R}\mathrm{TV}$. Moreover, two natural split-wise continuous surrogates--piecewise-linear ramp smoothing and sigmoidal (logistic) smoothing--also have infinite $\mathrm{R}\mathrm{TV}$ in dimensions $d>1$, while Gaussian convolution yields finite $\mathrm{R}\mathrm{TV}$ but with an explicit exponential dependence on $d$.
We then separate two goals that are often conflated: classification after thresholding (recovering the decision set) versus score learning (learning a calibrated score close to $1_A$). For classification, we construct a smooth barrier score $S_A$ with finite $\mathrm{R}\mathrm{TV}$ whose fixed threshold $τ=1$ exactly recovers the box. Under a mild tube-mass condition near $\partial A$, we prove an $L_1(P)$ calibration bound that decays polynomially in a sharpness parameter, along with an explicit $\mathrm{R}\mathrm{TV}$ upper bound in terms of face measures. Experiments on synthetic unions of rectangles illustrate the resulting accuracy--complexity tradeoff and how threshold selection shifts where training lands along it.
Why we are recommending this paper?
Due to your Interest in Machine Learning Operations
Peking University
Abstract
Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.
Why we are recommending this paper?
Due to your Interest in Online inference
Technical University of Munich
Abstract
Cat states are an important resource for fault-tolerant quantum computing, where they serve as building blocks for a variety of fault-tolerant primitives. Consequently, the ability to prepare high-quality cat states at large fault distances is essential. While optimizations for low fault distances or small numbers of qubits exist, higher fault distances can be achieved via generalized constructions with potentially suboptimal circuit sizes. In this work, we propose a cat state preparation scheme based on preparing two cat states with low-depth circuits, followed by a transversal CNOT and measurement of one of the states. This scheme prepares $w$-qubit cat states fault-tolerantly up to fault distances of $9$ using $\lceil\log_2 w\rceil+1$ depth and at most $3w-2$ CNOTs and $2w$ qubits. We discuss that the combinatorially challenging aspect of this construction is the precise wiring of the transversal CNOT and propose three methods for finding these: two based on Satisfiability Modulo Theory solving and one heuristic search based on a local repair strategy. Numerical evaluations show that our circuits achieve a high fault-distance while requiring fewer resources as generalized constructions.
Why we are recommending this paper?
Due to your Interest in Fault tolerance
Colorado State University
AI Insights - Reliability and resiliency are intertwined system characteristics that can feed into and/or off of each other. [2]
Abstract
Resiliency has garnered attention in the management of critical infrastructure as a metric of system performance, but there are significant roadblocks to its implementation in a realistic decision-making framework. Contrasted to risk and reliability, which have robust quantification approaches and undergird many regulatory approaches to system safety (e.g., "risk-informed decision-making"), resiliency is a diffuse, qualitatively-understood characteristic, often treated differently or distinctly. However, in the emerging context of highly-complex, highly-interdependent critical systems, the idea of reliability (as the probability of non-failure) may not be an appropriate metric of system health. As a result, focus is shifting towards resiliency-centered approaches that value the response to failure as much as the avoidance of failure. Supporting this approach requires a robustly-defined, quantitative understanding of resiliency. In this paper, we explore the foundations of reliability and resiliency engineering, and propose an approach to resiliency-informed decision-making bolstered by a quantitative understanding of resiliency.
Why we are recommending this paper?
Due to your Interest in Fault tolerance
Stanford
Abstract
As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.
Why we are recommending this paper?
Due to your Interest in Machine Learning Testing
Indian Statistical Institute Kolkata
Abstract
Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.
Why we are recommending this paper?
Due to your Interest in Machine Learning Resilience
Interests not found
We did not find any papers that match the below interests.
Try other terms also consider if the content exists in arxiv.org.
- Data Science Development Tools
- Machine Learning Infrastructure
- Data Science Development Environment and Productivity
- MLOps
💬 Help Shape Our Pricing
We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.
Share Your Feedback
Help us improve your experience!
This project is on its early stages your feedback can be pivotal on the future of the project.
Let us know what you think about this week's papers and suggestions!
Give Feedback