ByteDance Seed, UC Berkly
Abstract
Modern large language models leverage Mixture-of-Experts (MoE) architectures
for efficient scaling, but face a critical challenge: functionally similar
experts are often selected simultaneously, creating redundant computation and
limiting effective model capacity. Existing auxiliary balance loss methods
improve token distribution but fail to address the underlying expert diversity
problem. We introduce GatePro, a novel parameter-free method that directly
promotes expert selection diversity. GatePro identifies the most similar expert
pairs and introduces localized competition mechanisms, preventing redundant
expert co-activation while maintaining natural expert specialization. Our
comprehensive evaluation demonstrates GatePro's effectiveness across model
scales and benchmarks. Analysis demonstrates GatePro's ability to achieve
enhanced expert diversity, where experts develop more distinct and
complementary capabilities, avoiding functional redundancy. This approach can
be deployed hot-swappable during any training phase without additional
learnable parameters, offering a practical solution for improving MoE
effectiveness.
AI Insights - GatePro’s competitive propagation forces experts to “compete for the spotlight,” boosting utilization patterns that linger after the mechanism is off.
- Because it’s parameter‑free, you can swap GatePro on or off during training without tweaking hyper‑parameters—just a flag change.
- Longer GatePro exposure sharpens expert specialization, turning a crowd of similar models into a choir of complementary voices.
- The method levels the token‑distribution playing field, preventing a few experts from hogging all the data.
- A “training legacy effect” means GatePro’s benefits persist, giving a performance boost without extra runtime cost at inference.
- For deeper dives, see the MoE scaling papers: 2203.16535, 2106.14448, 2004.04722, 1905.09790, and 1802.05365.
- Note: GatePro can be computationally heavy during training, so plan resources accordingly.
University of Science, Ho
Abstract
We develop a unified statistical framework for softmax-gated Gaussian mixture
of experts (SGMoE) that addresses three long-standing obstacles in parameter
estimation and model selection: (i) non-identifiability of gating parameters up
to common translations, (ii) intrinsic gate-expert interactions that induce
coupled differential relations in the likelihood, and (iii) the tight
numerator-denominator coupling in the softmax-induced conditional density. Our
approach introduces Voronoi-type loss functions aligned with the gate-partition
geometry and establishes finite-sample convergence rates for the maximum
likelihood estimator (MLE). In over-specified models, we reveal a link between
the MLE's convergence rate and the solvability of an associated system of
polynomial equations characterizing near-nonidentifiable directions. For model
selection, we adapt dendrograms of mixing measures to SGMoE, yielding a
consistent, sweep-free selector of the number of experts that attains
pointwise-optimal parameter rates under overfitting while avoiding multi-size
training. Simulations on synthetic data corroborate the theory, accurately
recovering the expert count and achieving the predicted rates for parameter
estimation while closely approximating the regression function. Under model
misspecification (e.g., $\epsilon$-contamination), the dendrogram selection
criterion is robust, recovering the true number of mixture components, while
the Akaike information criterion, the Bayesian information criterion, and the
integrated completed likelihood tend to overselect as sample size grows. On a
maize proteomics dataset of drought-responsive traits, our dendrogram-guided
SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes
the likelihood early, and yields interpretable genotype-phenotype maps,
outperforming standard criteria without multi-size training.
AI Insights - Theorem 4 shows that for κ>K0 the Voronoi loss h(κ)N decays as (log N/N)^{1/¯r(bGN)}, making DSC(κ)N suboptimal.
- Thus DSC(κ)N is minimized uniquely at κ=K0, proving the dendrogram selector bKN converges to the true expert count without sweeps.
- The MLE’s rate in over‑specified SGMoE links to solvability of polynomial equations that capture near‑nonidentifiable directions.
- With ε‑contamination, the dendrogram criterion stays robust, whereas AIC, BIC, and ICML over‑select as N grows.
- A maize proteomics case shows a two‑expert SGMoE stabilizes likelihood early, exposes a clear mixing‑measure hierarchy, and yields interpretable genotype‑phenotype maps.