Jiaming Yan, Jianchun Liu
Abstract
Mixture-of-Experts (MoE) has emerged as a promising architecture for modern
large language models (LLMs). However, massive parameters impose heavy GPU
memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs.
Offloading the expert parameters to CPU RAM offers an effective way to
alleviate the VRAM requirements for MoE inference. Existing approaches
typically cache a small subset of experts in VRAM and dynamically prefetch
experts from RAM during inference, leading to significant degradation in
inference speed due to the poor cache hit rate and substantial expert loading
latency. In this work, we propose MoEpic, an efficient MoE inference system
with a novel expert split mechanism. Specifically, each expert is vertically
divided into two segments: top and bottom. MoEpic caches the top segment of hot
experts, so that more experts will be stored under the limited VRAM budget,
thereby improving the cache hit rate. During each layer's inference, MoEpic
predicts and prefetches the activated experts for the next layer. Since the top
segments of cached experts are exempt from fetching, the loading time is
reduced, which allows efficient transfer-computation overlap. Nevertheless, the
performance of MoEpic critically depends on the cache configuration (i.e., each
layer's VRAM budget and expert split ratio). To this end, we propose a
divide-and-conquer algorithm based on fixed-point iteration for adaptive cache
configuration. Extensive experiments on popular MoE LLMs demonstrate that
MoEpic can save about half of the GPU cost, while lowering the inference
latency by about 37.51%-65.73% compared to the baselines.
AI Insights - MoEpic splits each expert vertically into a lightweight top segment and a heavier bottom segment, allowing more experts to fit in limited VRAM.
- The top segments of hot experts are cached, boosting cache hit rates, while the bottom segments are fetched on demand during inference.
- A divideâandâconquer algorithm based on fixedâpoint iteration automatically tunes perâlayer VRAM budgets and split ratios for optimal performance.
- Dynamic prefetching of the next layerâs activated experts overlaps data transfer with computation, cutting loading latency by up to 65%.
- MoEpic achieves up to 3.5Ă speedup over baseline MoE inference and can be combined with pruning or skipping techniques for further gains.
- The method assumes evenly distributed GPU resources and may struggle when interâexpert communication dominates, highlighting a key limitation.
Abstract
Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes
of neural networks, wherein independently trained models have been observed to
be connected--up to permutation symmetries--by linear paths in parameter space
along which the loss remains consistently low. This observation challenges
classical views of non-convex optimization and has implications for model
ensembling, generalization, and our understanding of neural loss geometry.
Inspired by recent studies on LMC in standard neural networks, we
systematically investigate this phenomenon within Mixture-of-Experts (MoE)
architectures--a class of models known for their scalability and computational
efficiency, which combine traditional neural networks--referred to as
experts--through a learnable gating mechanism. We begin by conducting a
comprehensive analysis of both dense and sparse gating regimes, demonstrating
that the symmetries inherent to MoE architectures are fully characterized by
permutations acting on both the expert components and the gating function.
Building on these foundational findings, we propose a matching algorithm that
enables alignment between independently trained MoEs, thereby facilitating the
discovery of LMC. Finally, we empirically validate the presence of LMC using
our proposed algorithm across diverse MoE configurations--including dense,
sparse, and shared-expert variants--under a wide range of model settings and
datasets of varying scales and modalities. Our results confirm the existence of
LMC in MoE architectures and offer fundamental insights into the functional
landscape and optimization dynamics of deep learning models.