🎯 Top Personalized Recommendations
Hong Kong University of
Why we think this paper is great for you:
This paper directly addresses accelerating Mixture-of-Experts Multimodal Large Language Models, which is a highly relevant combination of advanced architectures and learning paradigms for you. It offers insights into improving the computational efficiency of these complex models.
Abstract
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
AI Summary - The Globally-Modulated Local Gating (GMLG) mechanism integrates offline-calibrated global layer-wise importance factors (α(l)) with local routing probabilities (π(l)i) to accurately estimate per-token expert importance scores (s(l)i). [3]
- MoDES achieves substantial performance enhancements (e.g., up to 10.67% average performance boost for Qwen3-VL-MoE-30B) at high expert skipping ratios (>80%), while consistently retaining >95% accuracy of original models. [3]
- The framework significantly accelerates inference, demonstrating a ~2.16x speedup in prefilling time and a ~1.26x speedup in decoding time for large MoE MLLMs. [3]
- Globally-Modulated Local Gating (GMLG): A mechanism within MoDES that combines a global layer-specific importance factor (α(l)) with local routing probabilities (π(l)i) to compute refined importance scores (s(l)i) for top-k experts. [3]
- Existing expert skipping methods, designed for unimodal LLMs, cause significant performance degradation when applied to MoE MLLMs due to their failure to account for heterogeneous expert contributions across layers and modality-specific token behaviors. [2]
- MoDES introduces a training-free framework that adaptively skips experts in MoE MLLMs, achieving efficient and accurate inference by explicitly addressing the aforementioned limitations. [2]
- The Dual-Modality Thresholding (DMT) method applies distinct, modality-specific thresholds (τt for text, τv for vision) to expert importance scores, enabling a tailored and effective skipping strategy for multimodal inputs. [2]
- A novel frontier search algorithm efficiently determines optimal modality-specific thresholds by leveraging monotonicity properties of performance loss and efficiency, reducing search time from days to hours. [2]
- MoDES (Multimodal Dynamic Expert Skipping): The first training-free framework designed for MoE MLLMs that adaptively skips redundant experts to enable efficient and accurate inference. [2]
- Dual-Modality Thresholding (DMT): A method within MoDES that applies separate, modality-specific thresholds (τt for text tokens and τv for visual tokens) to expert importance scores to determine which experts to skip. [2]
University of Connecticut
Why we think this paper is great for you:
You will find this paper highly relevant as it explores dynamic quantization for Mixture-of-Experts models, specifically addressing scalability for Large Language Model inference. This directly aligns with optimizing advanced deep learning architectures.
Abstract
Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets.
Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.
Korea University
Why we think this paper is great for you:
This work on uncertainty-resilient multimodal learning is a strong match, as it tackles critical challenges in integrating diverse data types effectively. It provides valuable strategies for building robust multimodal systems.
Abstract
Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.
Peking University
Why we think this paper is great for you:
This paper offers a novel perspective on Latent Diffusion Models by unifying their architecture for end-to-end training, which could significantly streamline your work with generative models. It presents an efficient approach to diffusion model design.
Abstract
Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
Sony
Why we think this paper is great for you:
You will appreciate this paper's exploration of MeanFlow Transformers, a diffusion-motivated generative model that learns efficient few-step generation. It offers insights into advanced deep learning architectures for generative tasks.
Abstract
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive 1-step FID of 3.23 with the lowest GFLOPS among all baselines. Code is available at https://github.com/sony/mf-rae.
University College London
Why we think this paper is great for you:
This paper on multimodal representation learning, even in a specific domain, directly aligns with your interest in integrating and learning from diverse data sources. It showcases practical applications of multimodal techniques.
Abstract
Paediatric kidney disease varies widely in its presentation and progression, which calls for continuous monitoring of renal function. Using electronic health records collected between 2019 and 2025 at Great Ormond Street Hospital, a leading UK paediatric hospital, we explored a temporal modelling approach that integrates longitudinal laboratory sequences with demographic information. A recurrent neural model trained on these data was used to predict whether a child would record an abnormal serum creatinine value within the following thirty days. Framed as a pilot study, this work provides an initial demonstration that simple temporal representations can capture useful patterns in routine paediatric data and lays the groundwork for future multimodal extensions using additional clinical signals and more detailed renal outcomes.
BiostateAI
Why we think this paper is great for you:
This research investigates how Large Language Models handle probabilistic distributions, highlighting a fundamental aspect of their behavior and limitations. It offers crucial insights into the capabilities and challenges of LLMs.
Abstract
Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of "1" 49% of the time produces an answer of "0" nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.