Large Language Models

Towards a Unified View of Large Language Model Post-Training

Tsinghua University,2Shan

Abstract
Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

AI Insights

Unified policy gradient estimator decomposes into stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient.
A trust‑region regularizer penalizes KL divergence from a fixed reference policy to enforce conservative updates.
Hybrid Post‑Training (HPT) dynamically switches between demonstration and exploration signals, preserving reasoning while boosting exploration.
Closed‑form gradients for PPO, GRPO, and others are derived with and without the trust‑region term, and ablation studies confirm HPT’s superiority on six math‑reasoning benchmarks and two out‑of‑distribution suites.
Recommended reading: Sutton & Barto, DeepMind’s Deep RL, and Schulman et al.’s PPO and TRPO papers.

September 04, 2025

♥Save to Reading List

Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

Ming Gong University of

Abstract
This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.

AI Insights

Differentiable gating auto‑selects adapter points, making the model self‑configuring.
Sparsity variables prune unused paths, yielding a lean sub‑structure that still captures task nuances.
Dynamic routing lets each task activate a unique module combo, boosting capacity without touching the backbone.
Noise‑injection tests confirm stability even under harsh input corruption.
Learned sparsity maps reveal essential layers, enhancing interpretability.
Extending with low‑rank or cross‑modal adapters creates multimodal, resource‑efficient pipelines.
See “Long LORA” for long‑context fine‑tuning and “Parameter‑Efficient Fine‑Tuning of Large‑Scale Pre‑Trained Language Models” for theory.

September 03, 2025

♥Save to Reading List

Deep Learning

Unveiling the Role of Data Uncertainty in Tabular Deep Learning

HSE University, Yandex

Abstract
Recent advancements in tabular deep learning have demonstrated exceptional practical performance, yet the field often lacks a clear understanding of why these techniques actually succeed. To address this gap, our paper highlights the importance of the concept of data uncertainty for explaining the effectiveness of the recent tabular DL methods. In particular, we reveal that the success of many beneficial design choices in tabular DL, such as numerical feature embeddings, retrieval-augmented models and advanced ensembling strategies, can be largely attributed to their implicit mechanisms for managing high data uncertainty. By dissecting these mechanisms, we provide a unifying understanding of the recent performance improvements. Furthermore, the insights derived from this data-uncertainty perspective directly allowed us to develop more effective numerical feature embeddings as an immediate practical outcome of our analysis. Overall, our work paves the way to foundational understanding of the benefits introduced by modern tabular methods that results in the concrete advancements of existing techniques and outlines future research directions for tabular DL.

AI Insights

Swapping Bayesian, MC‑Dropout, or ensemble uncertainty estimators leaves the MSE trend unchanged across datasets.
Figures show the performance gap between baseline and advanced tabular models is invariant to the uncertainty technique.
This invariance confirms conclusions are not artifacts of a specific uncertainty model.
Authors assume uncertainty estimators are accurate, which may fail in low‑sample or noisy regimes.
Data quality and sampling bias were not modeled, leaving room for future robust preprocessing work.
Recommended resources include “Bayesian Methods for Hackers” and a TensorFlow uncertainty tutorial.
Robustness of tabular DL hinges on design choices and fidelity of uncertainty estimates, inspiring hybrid architectures.

September 04, 2025

♥Save to Reading List

Comment on "Deep Regression Learning with Optimal Loss Function"

OpenReview benefits the

Abstract
OpenReview benefits the peer-review system by promoting transparency, openness, and collaboration. By making reviews, comments, and author responses publicly accessible, the platform encourages constructive feedback, reduces bias, and allows the research community to engage directly in the review process. This level of openness fosters higher-quality reviews, greater accountability, and continuous improvement in scholarly communication. In the statistics community, such a transparent and open review system has not traditionally existed. This lack of transparency has contributed to significant variation in the quality of published papers, even in leading journals, with some containing substantial errors in both proofs and numerical analyses. To illustrate this issue, this note examines several results from Wang, Zhou and Lin (2025) [arXiv:2309.12872; https://doi.org/10.1080/01621459.2024.2412364] and highlights potential errors in their proofs, some of which are strikingly obvious. This raises a critical question: how important are mathematical proofs in statistical journals, and how should they be rigorously verified? Addressing this question is essential not only for maintaining academic rigor but also for fostering the right attitudes toward scholarship and quality assurance in the field. A plausible approach would be for arXiv to provide an anonymous discussion section, allowing readers-whether anonymous or not-to post comments, while also giving authors the opportunity to respond.

AI Insights

Theorem 1, 2, and Proposition 1 in Wang et al. (2025) contain algebraic errors that undermine convergence claims.
A chain‑rule misuse in Proposition 1’s gradient derivation exposes a common pitfall in high‑dimensional M‑estimation.
Minor proof mistakes can distort simulations, stressing theory‑code cross‑validation.
An anonymous arXiv discussion could serve as a live proof‑audit platform before acceptance.
Casella & Berger’s text remains essential for mastering probabilistic foundations that safeguard proofs.
Feng et al.’s score‑matching offers a robust alternative to conventional loss functions, aligning with optimality.
JASA’s reproducibility editorial echoes the push for transparent peer review.

September 03, 2025

♥Save to Reading List

Deep Learning Architectures

NeurStore: Efficient In-database Deep Learning Model Management System

National University of Sg

Abstract
With the prevalence of in-database AI-powered analytics, there is an increasing demand for database systems to efficiently manage the ever-expanding number and size of deep learning models. However, existing database systems typically store entire models as monolithic files or apply compression techniques that overlook the structural characteristics of deep learning models, resulting in suboptimal model storage overhead. This paper presents NeurStore, a novel in-database model management system that enables efficient storage and utilization of deep learning models. First, NeurStore employs a tensor-based model storage engine to enable fine-grained model storage within databases. In particular, we enhance the hierarchical navigable small world (HNSW) graph to index tensors, and only store additional deltas for tensors within a predefined similarity threshold to ensure tensor-level deduplication. Second, we propose a delta quantization algorithm that effectively compresses delta tensors, thus achieving a superior compression ratio with controllable model accuracy loss. Finally, we devise a compression-aware model loading mechanism, which improves model utilization performance by enabling direct computation on compressed tensors. Experimental evaluations demonstrate that NeurStore achieves superior compression ratios and competitive model loading throughput compared to state-of-the-art approaches.

September 03, 2025

♥Save to Reading List

Multimodal Learning

Robult: Leveraging Redundancy and Modality Specific Features for Robust Multimodal Learning

UIUC, US; VinUniversity

Abstract
Addressing missing modalities and limited labeled data is crucial for advancing robust multimodal learning. We propose Robult, a scalable framework designed to mitigate these challenges by preserving modality-specific information and leveraging redundancy through a novel information-theoretic approach. Robult optimizes two core objectives: (1) a soft Positive-Unlabeled (PU) contrastive loss that maximizes task-relevant feature alignment while effectively utilizing limited labeled data in semi-supervised settings, and (2) a latent reconstruction loss that ensures unique modality-specific information is retained. These strategies, embedded within a modular design, enhance performance across various downstream tasks and ensure resilience to incomplete modalities during inference. Experimental results across diverse datasets validate that Robult achieves superior performance over existing approaches in both semi-supervised learning and missing modality contexts. Furthermore, its lightweight design promotes scalability and seamless integration with existing architectures, making it suitable for real-world multimodal applications.

AI Insights

Robult’s transferability outperforms all baselines when fine‑tuned on unseen datasets.
In zero‑shot transfer, only Robult yields coherent predictions from full‑modal inputs, unlike competitors.
Pairing Robult with Geometric Contrastive Loss boosts GMC’s spatial alignment under scarce labels.
The confusion matrix shows Robult’s pseudo‑label accuracy spikes with a weighting scheme.
Mutual information between fused and unimodal embeddings is higher with Robult than with vanilla Soft PU loss.
Recommended reading: “Deep Learning” and “Pattern Recognition and Machine Learning” for theory, plus “Geometric Contrastive Loss for Multimodal Alignment” for state‑of‑the‑art methods; multimodal alignment means aligning representations from different modalities to preserve geometrical structure.

September 03, 2025

♥Save to Reading List

MCIGLE: Multimodal Exemplar-Free Class-Incremental Graph Learning

Abstract
Exemplar-free class-incremental learning enables models to learn new classes over time without storing data from old ones. As multimodal graph-structured data becomes increasingly prevalent, existing methods struggle with challenges like catastrophic forgetting, distribution bias, memory limits, and weak generalization. We propose MCIGLE, a novel framework that addresses these issues by extracting and aligning multimodal graph features and applying Concatenated Recursive Least Squares for effective knowledge retention. Through multi-channel processing, MCIGLE balances accuracy and memory preservation. Experiments on public datasets validate its effectiveness and generalizability.

September 07, 2025

♥Save to Reading List

Diffusion Models

Fitting Image Diffusion Models on Video Datasets

Sungkyungwan University

Abstract
Image diffusion models are trained on independently sampled static images. While this is the bedrock task protocol in generative modeling, capturing the temporal world through the lens of static snapshots is information-deficient by design. This limitation leads to slower convergence, limited distributional coverage, and reduced generalization. In this work, we propose a simple and effective training strategy that leverages the temporal inductive bias present in continuous video frames to improve diffusion training. Notably, the proposed method requires no architectural modification and can be seamlessly integrated into standard diffusion training pipelines. We evaluate our method on the HandCo dataset, where hand-object interactions exhibit dense temporal coherence and subtle variations in finger articulation often result in semantically distinct motions. Empirically, our method accelerates convergence by over 2$\text{x}$ faster and achieves lower FID on both training and validation distributions. It also improves generative diversity by encouraging the model to capture meaningful temporal variations. We further provide an optimization analysis showing that our regularization reduces the gradient variance, which contributes to faster convergence.

September 04, 2025

♥Save to Reading List

Divergence-Kernel method for linear responses and diffusion models

University of California

Abstract
We derive the divergence-kernel formula for the linear response (parameter-derivative of marginal or stationary distributions) of random dynamical systems, and formally pass to the continuous-time limit. Our formula works for multiplicative and parameterized noise over any period of time; it does not require hyperbolicity. Then we derive a pathwise Monte-Carlo algorithm for linear responses. With this, we propose a forward-only diffusion generative model and test on simple problems.

AI Insights

The divergence‑kernel formula links distribution derivatives to the transfer‑operator kernel, avoiding additive‑noise assumptions.
It works even when δσ≡0, handling degenerate noise that defeats likelihood‑ratio or Malliavin methods.
The proof uses only ergodicity and a Lyapunov function, sidestepping the hyperbolicity requirement of most linear‑response results.
A pathwise Monte‑Carlo algorithm follows, enabling forward‑only sampling of diffusion generative models without backward passes.
Tests on a chaotic Lorenz‑type system and an unstable Ornstein‑Uhlenbeck process confirm the method’s accuracy.
The technique is attractive for finance, where unstable stochastic volatility models violate standard assumptions.
Future extensions to high‑dimensional stochastic PDEs could furnish engineers and physicists with a powerful sensitivity tool.

September 04, 2025

♥Save to Reading List

Mixture of Experts

Social Learning from Experts with Uncertain Precision

French National Research

Abstract
We study social learning from multiple experts whose precision is unknown and who care about reputation. The observer both learns a persistent state and ranks experts. In a binary baseline we characterize per-period equilibria: high types are truthful; low types distort one-sidedly with closed-form mixing around the prior. Aggregation is additive in log-likelihood ratios. Light-touch design -- evaluation windows scored by strictly proper rules or small convex deviation costs -- restores strict informativeness and delivers asymptotic efficiency under design (consistent state learning and reputation identification). A Gaussian extension yields a mimicry coefficient and linear filtering. With common shocks, GLS weights are optimal and correlation slows learning. The framework fits advisory panels, policy committees, and forecasting platforms, and yields transparent comparative statics and testable implications.

AI Insights

The authors embed a machine‑learning pipeline to quantify expert precision from noisy reports.
A public replication package and open‑source code accompany every simulation and estimation result.
Comparative‑statics show how reputation incentives reshape the mixing distribution for low‑type experts.
The Gaussian extension introduces a mimicry coefficient that captures how experts imitate each other’s signals.
Linear filtering of continuous reports yields a closed‑form estimator that is asymptotically efficient.
With common shocks, GLS weights dominate and correlation is shown to slow learning predictably.
The framework applies to advisory panels, policy committees, and forecasting platforms, offering testable predictions.

September 01, 2025

♥Save to Reading List

MoPEQ: Mixture of Mixed Precision Quantized Experts

Argonne National Lab, and

Abstract
Large Language and Vision Models using a Mixture-of-Experts (MoE) architecture pose significant challenges for deployment due to their computational and memory demands. Mixed Precision Quantization assigns different precisions to different layers of an LLM/VLM based on layer sensitivity and importance within the model. In this work, we propose a Post Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each expert. Our method balances accuracy and model size by analyzing each expert's sensitivity using Hessian trace approximation instead of relying on the activation frequency of the expert. This per-expert granularity approach clusters similar experts to maintain model performance while reducing memory requirements. The experimental results on VLMEvalKit benchmark datasets using State-of-the-art VLMs Deepseek-VL2 -tiny, -small, -base, and MolmoE models demonstrate that our mixed precision quantized MoEs achieve competitive accuracy with substantial improvements in memory footprint compared to uniform-precision baseline methods. We perform a comprehensive study to analyze the impact of expert activation frequency and sensitivity using Hessian trace approximation at both layer-wise and model-wide expert precision allocation of 2, 3, and 4 bits to provide a thorough understanding of mixed precision quantization of VLM-MoEs.

AI Insights

SmoothQuant achieves state‑of‑the‑art accuracy in PTQ for LLMs.
ZeroQuant offers affordable PTQ for large transformers.
Knowledge distillation can complement PTQ to further reduce model size.
Hessian trace approximation provides a more accurate sensitivity metric than activation frequency.
Clustering experts by sensitivity preserves performance while cutting memory.
MoPEQ's per‑expert bit‑width allocation outperforms uniform‑precision baselines on VLMEvalKit.
The paper recommends Deep Learning by Goodfellow et al. and Large‑Scale Deep Learning by Ng for foundational knowledge.

September 02, 2025

♥Save to Reading List

Interests not found

Help us improve your experience!