🎯 Top Personalized Recommendations
Kuaishou Technology
AI Summary - The authors use a combination of horizontal flips as data augmentation, AdamW optimizer, and exponential moving average (EMA) for training DSMoE models. [3]
- The paper proposes a new architecture called DSMoE that combines diffusion models and mixture-of-experts (MoE) layers to improve the performance of image synthesis tasks. [2]
- The paper also proposes a new routing strategy called Expert Race that allows for flexible routing in diffusion transformers with MoE layers. [1]
Abstract
Recent efforts on Diffusion Mixture-of-Experts (MoE) models have primarily focused on developing more sophisticated routing mechanisms. However, we observe that the underlying architectural configuration space remains markedly under-explored. Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models--including DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. Our systematic study reveals that carefully tuning these configurations is essential for unlocking the full potential of Diffusion MoE models, often yielding gains that exceed those achieved by routing innovations alone. Through extensive experiments, we present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks, which provide a practical and efficient training recipe that enables Diffusion MoE models to surpass strong baselines while using equal or fewer activated parameters. All code and models are publicly available at: https://github.com/yhlleo/EfficientMoE.
Why we think this paper is great for you:
This paper offers practical insights into the efficient training of Diffusion Mixture-of-Experts models. It directly aligns with your interest in both Diffusion Models and Mixture of Experts architectures, alongside optimization.
Chicago Booth
Abstract
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
Why we think this paper is great for you:
This work provides a theoretical framework for optimizing Sparse Mixture-of-Experts in large-scale AI models. It is highly relevant to your focus on Mixture of Experts, Deep Learning Optimization, and Large Language Models.
Microsoft
Abstract
Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
Why we think this paper is great for you:
This paper presents a method to significantly accelerate Diffusion Models, reducing their computational cost. It directly addresses your interest in Diffusion Models and Deep Learning Optimization for efficiency.
The Chinese University of
AI Summary - Reward-guided diffusion models can be used to directly modify the unguided backward process in order to achieve a specific goal, such as maximizing an external reward function. [3]
- The proposed sampler for reward-guided diffusion models offers several practical advantages, including reduced implementation complexity and the ability to reuse pretrained scores for any choice of guidance strength. [3]
- Reward-guided diffusion models: A type of diffusion model that uses an external reward function to guide the generation process and achieve a specific goal. [3]
- Classifier-free diffusion guidance does not uniformly enhance the quality of every generated sample, but improves overall sample quality by reducing the expected reciprocal of the classifier probability. [2]
- The performance of guidede diffusion models is often assessed by two criteria: diversity and sample quality. [1]
Abstract
Guided or controlled data generation with diffusion models\blfootnote{Partial preliminary results of this work appeared in International Conference on Machine Learning 2025 \citep{li2025provable}.} has become a cornerstone of modern generative modeling. Despite substantial advances in diffusion model theory, the theoretical understanding of guided diffusion samplers remains severely limited. We make progress by developing a unified algorithmic and theoretical framework that accommodates both diffusion guidance and reward-guided diffusion. Aimed at fine-tuning diffusion models to improve certain rewards, we propose injecting a reward guidance term -- constructed from the difference between the original and reward-reweighted scores -- into the backward diffusion process, and rigorously quantify the resulting reward improvement over the unguided counterpart. As a key application, our framework shows that classifier-free guidance (CFG) decreases the expected reciprocal of the classifier probability, providing the first theoretical characterization of the specific performance metric that CFG improves for general target distributions. When applied to reward-guided diffusion, our framework yields a new sampler that is easy-to-train and requires no full diffusion trajectories during training. Numerical experiments further corroborate our theoretical findings.
Why we think this paper is great for you:
This research explores a unified framework for guided or controlled data generation with diffusion models. It is a strong match for your interest in the foundational and advanced aspects of Diffusion Models.
National Technical Univer
AI Summary - The text discusses various aspects of deep learning, including model architecture, training, optimization, and inference. [3]
- Model Training: The process that makes a DNN learn to perform a specific task, much like a student learns from practice and correction. [3]
- Batch Training: Instead of feeding individual data points one by one, models are trained on small groups of samples called batches. [3]
- Training often requires many epochs to fully learn the data’s patterns. [3]
- The text concludes that deep learning involves various steps from model architecture to inference, and optimization is crucial for efficient deployment of DNNs. [3]
- The text mentions several deep learning frameworks such as PyTorch, TensorFlow, JAX, and Hugging Face Hub. [3]
- Deep learning involves various steps from model architecture to inference, and optimization is crucial for efficient deployment of DNNs. [3]
- But, just like how you need to practice and get better at recognizing cats, the computer needs to be trained and optimized so that it can perform well in real-world situations. [3]
- Epochs: A single pass through the entire dataset is called an epoch. [2]
- The text does not provide a clear explanation of the differences between various model representations such as ONNX, TorchScript, TensorFlow SavedModel / GraphDef, etc. [1]
Abstract
The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.
Why we think this paper is great for you:
This paper investigates sparsity as a critical mechanism to manage the immense computational demands of Deep Neural Networks during inference. It directly relates to your interest in Deep Learning Optimization and efficient architectures.
Shanghai Artificial InteI
AI Summary - The proposed model may require significant computational resources due to the complexity of the memory-augmentation mechanism. [3]
- A new memory-augmentation technique is presented for large language models (LLMs) that can learn and store information from multiple sources, enhancing their ability to retain and retrieve information. [3]
- Imagine you're trying to solve a complex math problem. [3]
- You need to remember some key concepts from earlier in the problem, but your brain keeps forgetting them. [3]
- That's kind of like what happens with large language models (LLMs) when they try to understand and respond to questions or tasks that require remembering specific information. [3]
- The paper proposes a new memory-augmented model for large language models (LLMs) that can learn and store information from multiple sources. [2]
Abstract
Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.
Why we think this paper is great for you:
This paper introduces a multimodal memory system for lifelong learning agents, addressing limitations in large-scale language and vision models. It aligns well with your interest in Multimodal Learning and advanced Deep Learning Models.
NVIDIA Research
Abstract
Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.
Why we think this paper is great for you:
This tutorial distills the path towards voice-native Large Language Models and their adaptation to audio. It is highly relevant to your interests in both Large Language Models and Multimodal Learning.
Deep Learning Optimization
Georgia Institute of Tecn
Abstract
We propose an always-feasible quadratic programming (QP) optimizer, FlexQP, which is based on an exact relaxation of the QP constraints. If the original constraints are feasible, then the optimizer finds the optimal solution to the original QP. On the other hand, if the constraints are infeasible, the optimizer identifies a solution that minimizes the constraint violation in a sparse manner. FlexQP scales favorably with respect to the problem dimension, is robust to both feasible and infeasible QPs with minimal assumptions on the problem data, and can be effectively warm-started. We subsequently apply deep unfolding to improve our optimizer through data-driven techniques, leading to an accelerated Deep FlexQP. By learning dimension-agnostic feedback policies for the parameters from a small number of training examples, Deep FlexQP generalizes to problems with larger dimensions and can optimize for many more iterations than it was initially trained for. Our approach outperforms two recently proposed state-of-the-art accelerated QP approaches on a suite of benchmark systems including portfolio optimization, classification, and regression problems. We provide guarantees on the expected performance of our deep QP optimizer through probably approximately correct (PAC) Bayes generalization bounds. These certificates are used to design an accelerated sequential quadratic programming solver that solves nonlinear optimal control and predictive safety filter problems faster than traditional approaches. Overall, our approach is very robust and greatly outperforms existing non-learning and learning-based optimizers in terms of both runtime and convergence to the optimal solution across multiple classes of NLPs.
Deep Learning Models
Beihang University
Abstract
This article serves as the regression analysis lecture notes in the Intelligent Computing course cluster (including the courses of Artificial Intelligence, Data Mining, Machine Learning, and Pattern Recognition). It aims to provide students -- who are assumed to possess only basic university-level mathematics (i.e., with prerequisite courses in calculus, linear algebra, and probability theory) -- with a comprehensive and self-contained understanding of regression analysis without requiring any additional references. The lecture notes systematically introduce the fundamental concepts, modeling components, and theoretical foundations of regression analysis, covering linear regression, logistic regression, multinomial logistic regression, polynomial regression, basis-function models, kernel-based methods, and neural-network-based nonlinear regression. Core methodological topics include loss-function design, parameter-estimation principles, ordinary least squares, gradient-based optimization algorithms and their variants, as well as regularization techniques such as Ridge and LASSO regression. Through detailed mathematical derivations, illustrative examples, and intuitive visual explanations, the materials help students understand not only how regression models are constructed and optimized, but also how they reveal the underlying relationships between features and response variables. By bridging classical statistical modeling and modern machine-learning practice, these lecture notes aim to equip students with a solid conceptual and technical foundation for further study in advanced artificial intelligence models.