🎯 Top Personalized Recommendations
University of Waterloo
Why we think this paper is great for you:
This paper directly explores Bayesian Mixture of Experts within the context of Large Language Models, which aligns perfectly with your interest in both advanced model architectures and their application to large-scale language tasks. You will find its approach to uncertainty estimation particularly insightful for improving LLM reliability.
Abstract
We present Bayesian Mixture of Experts (Bayesian-MoE), a post-hoc uncertainty estimation framework for fine-tuned large language models (LLMs) based on Mixture-of-Experts architectures. Our method applies a structured Laplace approximation to the second linear layer of each expert, enabling calibrated uncertainty estimation without modifying the original training procedure or introducing new parameters. Unlike prior approaches, which apply Bayesian inference to added adapter modules, Bayesian-MoE directly targets the expert pathways already present in MoE models, leveraging their modular design for tractable block-wise posterior estimation. We use Kronecker-factored low-rank approximations to model curvature and derive scalable estimates of predictive uncertainty and marginal likelihood. Experiments on common-sense reasoning benchmarks with Qwen1.5-MoE and DeepSeek-MoE demonstrate that Bayesian-MoE improves both expected calibration error (ECE) and negative log-likelihood (NLL) over baselines, confirming its effectiveness for reliable downstream decision-making.
AI Summary - This method achieves calibrated uncertainty and improved predictive confidence (lower ECE and NLL) without modifying the original training procedure or introducing new parameters, unlike prior adapter-based Bayesian methods. [3]
- The approach consistently outperforms traditional baselines like MC Dropout, Checkpoint Ensembles, Deep Ensembles, and Bayesian-LoRA in calibration metrics (ECE, NLL) across various common-sense reasoning benchmarks (Qwen1.5-MoE, DeepSeek-MoE). [3]
- It demonstrates robust improvements in uncertainty quantification and maintains strong accuracy even under significant distribution shifts, indicating enhanced generalization ability. [3]
- Ablation studies reveal that earlier expert layers (Q1) contribute most significantly to calibrated uncertainty, suggesting a strategic focus for memory- or compute-constrained Bayesian deployments. [3]
- Kronecker-factored Low-Rank Approximations: A technique employed to efficiently model the curvature (Fisher Information Matrix) of expert parameters by projecting activation and gradient covariances onto lower-dimensional subspaces, enabling tractable computation of log-determinants and predictive uncertainty. [3]
- The framework offers a principled and efficient pathway to produce well-calibrated and reliable large language models by selectively Bayesianizing the most decisive parts of the MoE network. [2]
- Bayesian-MoE provides a post-hoc uncertainty estimation framework for fine-tuned LLMs by applying a structured Laplace approximation to the second linear layer of each expert. [1]
- Bayesian-MoE leverages Kronecker-factored low-rank approximations to model curvature, enabling scalable block-wise posterior estimation within the existing MoE architecture. [1]
- Bayesian Mixture of Experts (Bayesian-MoE): A post-hoc uncertainty estimation framework for fine-tuned LLMs that applies a structured Laplace approximation to the second linear layer of each expert, leveraging MoE modularity for tractable block-wise posterior estimation without adding parameters. [1]
- Structured Laplace Approximation: A method used in Bayesian-MoE to approximate the posterior distribution of expert parameters by constructing a Gaussian measure on the tangent space around the MAP estimate, with the Hessian approximated by a block-diagonal Kronecker-factored empirical Fisher information matrix. [1]
Shanghai Jiao Tong University
Why we think this paper is great for you:
Focusing on accelerating Mixture-of-Experts inference for language models, this research directly addresses your interest in optimizing deep learning architectures, especially MoE, for practical applications. It offers valuable insights into managing memory constraints and improving efficiency in large models.
Abstract
Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.
Why we think this paper is great for you:
This work provides a holistic evaluation of Large Multimodal Models, offering critical insights into their capabilities and limitations. It is highly relevant to your focus on multimodal learning and the development of robust large language models.
Abstract
The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs' critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at https://github.com/MichealZeng0420/MM-Critic.
Why we think this paper is great for you:
This paper delves into achieving robust multimodal learning in open-world scenarios, a key area for advancing deep learning models beyond controlled environments. You will appreciate its exploration of leveraging diverse data streams for enhanced model performance.
Abstract
The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements.
University of Freiburg
Why we think this paper is great for you:
This research on multi-objective hyperparameter optimization in deep learning directly caters to your interest in refining and improving the performance of deep learning models. It offers practical strategies for optimizing complex model configurations.
Abstract
While Deep Learning (DL) experts often have prior knowledge about which hyperparameter settings yield strong performance, only few Hyperparameter Optimization (HPO) algorithms can leverage such prior knowledge and none incorporate priors over multiple objectives. As DL practitioners often need to optimize not just one but many objectives, this is a blind spot in the algorithmic landscape of HPO. To address this shortcoming, we introduce PriMO, the first HPO algorithm that can integrate multi-objective user beliefs. We show PriMO achieves state-of-the-art performance across 8 DL benchmarks in the multi-objective and single-objective setting, clearly positioning itself as the new go-to HPO algorithm for DL practitioners.
Why we think this paper is great for you:
This paper explores aligning diffusion models with human preferences, which is a crucial aspect of developing more effective and user-centric generative models. It will be highly relevant to your interest in diffusion models and their practical applications.
Abstract
Diffusion models have achieved remarkable success in conditional image generation, yet their outputs often remain misaligned with human preferences. To address this, recent work has applied Direct Preference Optimization (DPO) to diffusion models, yielding significant improvements.~However, DPO-like methods exhibit two key limitations: 1) High computational cost,due to the entire model fine-tuning; 2) Sensitivity to reference model quality}, due to its tendency to introduce instability and bias. To overcome these limitations, we propose a novel framework for human preference alignment in diffusion models (PC-Diffusion), using a lightweight, trainable Preference Classifier that directly models the relative preference between samples. By restricting preference learning to this classifier, PC-Diffusion decouples preference alignment from the generative model, eliminating the need for entire model fine-tuning and reference model reliance.~We further provide theoretical guarantees for PC-Diffusion:1) PC-Diffusion ensures that the preference-guided distributions are consistently propagated across timesteps. 2)The training objective of the preference classifier is equivalent to DPO, but does not require a reference model.3) The proposed preference-guided correction can progressively steer generation toward preference-aligned regions.~Empirical results show that PC-Diffusion achieves comparable preference consistency to DPO while significantly reducing training costs and enabling efficient and stable preference-guided generation.
Why we think this paper is great for you:
This research offers a deeper understanding of the underlying mechanisms of diffusion models by examining the order of noise in the generation process. It provides fundamental insights into how these powerful models operate.
Abstract
In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model's generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step "Semantic Erasure-Injection" process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.
Deep Learning Architectures
University of Michigan
Abstract
We propose a deep neural-operator framework for a general class of probability models. Under global Lipschitz conditions on the operator over the entire Euclidean space-and for a broad class of probabilistic models-we establish a universal approximation theorem with explicit network-size bounds for the proposed architecture. The underlying stochastic processes are required only to satisfy integrability and general tail-probability conditions. We verify these assumptions for both European and American option-pricing problems within the forward-backward SDE (FBSDE) framework, which in turn covers a broad class of operators arising from parabolic PDEs, with or without free boundaries. Finally, we present a numerical example for a basket of American options, demonstrating that the learned model produces optimal stopping boundaries for new strike prices without retraining.
HUST
Abstract
Determining the minimum width of fully connected neural networks has become a fundamental problem in recent theoretical studies of deep neural networks. In this paper, we study the lower bounds and upper bounds of the minimum width required for fully connected neural networks in order to have universal approximation capability, which is important in network design and training. We show that $w_{min}\leq\max(2d_x+1, d_y)$ for networks with ELU, SELU, and the upper bound of this inequality is attained when $d_y=2d_x$, where $d_x$, $d_y$ denote the input and output dimensions, respectively. Besides, we show that $d_x+1\leq w_{min}\leq d_x+d_y$ for networks with LeakyReLU, ELU, CELU, SELU, Softplus, by proving that ReLU can be approximated by these activation functions. In addition, in the case that the activation function is injective or can be uniformly approximated by a sequence of injective functions (e.g., ReLU), we present a new proof of the inequality $w_{min}\ge d_y+\mathbf{1}_{d_x
Deep Learning
ENSTA Paris
Abstract
Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify the uncertainty of their predictions, limiting their broader adoption in critical real-world applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methods to improve the reliability of uncertainty estimates. Although numerous techniques have been proposed, a unified tool offering a seamless workflow to evaluate and integrate these methods remains lacking. To bridge this gap, we introduce Torch-Uncertainty, a PyTorch and Lightning-based framework designed to streamline DNN training and evaluation with UQ techniques and metrics. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at https://github.com/ENSTA-U2IS-AI/Torch-Uncertainty
Deep Learning Optimization
USTC
Abstract
Embedding deep neural networks (NNs) into mixed-integer programs (MIPs) is attractive for decision making with learned constraints, yet state-of-the-art monolithic linearisations blow up in size and quickly become intractable. In this paper, we introduce a novel dual-decomposition framework that relaxes the single coupling equality u=x with an augmented Lagrange multiplier and splits the problem into a vanilla MIP and a constrained NN block. Each part is tackled by the solver that suits it best-branch and cut for the MIP subproblem, first-order optimisation for the NN subproblem-so the model remains modular, the number of integer variables never grows with network depth, and the per-iteration cost scales only linearly with the NN size. On the public \textsc{SurrogateLIB} benchmark, our method proves \textbf{scalable}, \textbf{modular}, and \textbf{adaptable}: it runs \(120\times\) faster than an exact Big-M formulation on the largest test case; the NN sub-solver can be swapped from a log-barrier interior step to a projected-gradient routine with no code changes and identical objective value; and swapping the MLP for an LSTM backbone still completes the full optimisation in 47s without any bespoke adaptation.
Large Language Models
Abstract
Large Language Models (LLMs) exhibit strong performance across various natural language processing (NLP) tasks but remain vulnerable to hallucinations, generating factually incorrect or misleading outputs. Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue. However, existing methods often require multiple samples or extra computation to assess semantic entropy. This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses' top-$K$ probabilities. Moreover, we employ an adaptive mechanism to determine $K$ to enhance flexibility and filter out low-confidence probabilities. Experimental results on three free-form question-answering datasets across several LLMs demonstrate that our method outperforms expensive state-of-the-art baselines, contributing to the broader goal of enhancing LLM trustworthiness.
Abstract
In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.
Interests not found
We did not find any papers that match the below interests.
Try other terms also consider if the content exists in arxiv.org.
Help us improve your experience!
This project is on its early stages your feedback can be pivotal on the future of the project.
Let us know what you think about this week's papers and suggestions!
Give Feedback