🎯 Top Personalized Recommendations
Peking University, Byteda
Why we think this paper is great for you:
This paper explores scalable and privacy-preserving inference for Mixture of Experts architectures, which is highly relevant to your focus on advanced model architectures. It also delves into the practical deployment of large language models.
Abstract
Private large language model (LLM) inference based on cryptographic
primitives offers a promising path towards privacy-preserving deep learning.
However, existing frameworks only support dense LLMs like LLaMA-1 and struggle
to scale to mixture-of-experts (MoE) architectures. The key challenge comes
from securely evaluating the dynamic routing mechanism in MoE layers, which may
reveal sensitive input information if not fully protected. In this paper, we
propose CryptoMoE, the first framework that enables private, efficient, and
accurate inference for MoE-based models. CryptoMoE balances expert loads to
protect expert routing information and proposes novel protocols for secure
expert dispatch and combine. CryptoMoE also develops a confidence-aware token
selection strategy and a batch matrix multiplication protocol to improve
accuracy and efficiency further. Extensive experiments on DeepSeekMoE-16.4B,
OLMoE-6.9B, and QWenMoE-14.3B show that CryptoMoE achieves $2.8\sim3.5\times$
end-to-end latency reduction and $2.9\sim4.3\times$ communication reduction
over a dense baseline with minimum accuracy loss. We also adapt CipherPrune
(ICLR'25) for MoE inference and demonstrate CryptoMoE can reduce the
communication by up to $4.3 \times$. Code is available at:
https://github.com/PKU-SEC-Lab/CryptoMoE.
AI Summary - Experimental evaluations demonstrate CryptoMoE achieves 2.8-3.5x end-to-end latency reduction and 2.9-4.3x communication reduction over a dense baseline, while retaining 99.2% of original accuracy on average. [3]
- The framework proposes a novel confidence-aware secure dispatch protocol that re-ranks tokens by routing confidence and selects the top-t tokens for each expert, mitigating accuracy degradation from token dropping. [2]
- The selection of 't' (number of tokens per expert) is critical for balancing accuracy and efficiency, with t=2mk/n empirically providing a robust trade-off across various model configurations and sequence lengths. [2]
- Confidence-aware selection strategy: A method to mitigate accuracy loss from token dropping by re-ranking tokens assigned to an expert based on their routing confidence and retaining only the top-t tokens. [2]
- CryptoMoE introduces Inference-Time Balanced Expert Routing, ensuring each expert processes a fixed number of tokens (t) to prevent routing information leakage, a key challenge for privacy-preserving MoE inference. [1]
- A lightweight and secure one-hot-based combine protocol is designed to efficiently aggregate expert outputs and reconstruct token-wise results while preserving privacy and addressing token reordering challenges. [1]
- CryptoMoE significantly improves computational efficiency by introducing an efficient Batch Ciphertext-Plaintext Matrix Multiplication protocol, which packs partial token embeddings from all experts into a single ciphertext, reducing costly HE rotations from O(nd1) to O(d1). [1]
- Compared to an adaptation of CipherPrune for MoE, CryptoMoE achieves up to 4.3x communication reduction and 2.4x latency reduction, highlighting the efficiency of its custom dispatch and combine protocols. [1]
- Inference-Time Balanced Expert Routing: A core concept in CryptoMoE where each expert processes a fixed number of tokens (t), regardless of actual routing results, to ensure expert contributions are input-independent and preserve privacy. [1]
- Batch Ciphertext-Plaintext Matrix Multiplication (Batch MatMul) protocol: An optimization that packs partial token embeddings from multiple experts into a single ciphertext, reducing the number of expensive Homomorphic Encryption (HE) rotation operations in linear layer computations. [1]
University of Coimbra
Why we think this paper is great for you:
This paper offers valuable insights into optimizing deep generative models, particularly diffusion architectures, by comparing evolutionary algorithms with Adam. You will find its discussion on optimization techniques highly pertinent.
Abstract
Deep generative models, especially diffusion architectures, have transformed
image generation; however, they are challenging to control and optimize for
specific goals without expensive retraining. Embedding Space Exploration,
especially with Evolutionary Algorithms (EAs), has been shown to be a promising
method for optimizing image generation, particularly within Diffusion Models.
Therefore, in this work, we study the performance of an evolutionary
optimization method, namely Separable Covariance Matrix Adaptation Evolution
Strategy (sep-CMA-ES), against the widely adopted Adaptive Moment Estimation
(Adam), applied to Stable Diffusion XL Turbo's prompt embedding vector. The
evaluation of images combines the LAION Aesthetic Predictor V2 with CLIPScore
into a weighted fitness function, allowing flexible trade-offs between visual
appeal and adherence to prompts. Experiments on a subset of the Parti Prompts
(P2) dataset showcase that sep-CMA-ES consistently yields superior improvements
in aesthetic and alignment metrics in comparison to Adam. Results indicate that
the evolutionary method provides efficient, gradient-free optimization for
diffusion models, enhancing controllability without the need for fine-tuning.
This study emphasizes the potential of evolutionary methods for embedding space
exploration of deep generative models and outlines future research directions.
Harvard, UW
Why we think this paper is great for you:
This paper investigates optimal inference schedules for masked diffusion models, directly addressing efficiency challenges in these powerful generative models. It provides crucial information on optimizing their performance.
Abstract
A major bottleneck of standard auto-regressive large language models is that
their inference process is inherently sequential, resulting in very long and
costly inference times. To circumvent this, practitioners proposed a class of
language models called diffusion language models, of which the masked diffusion
model (MDM) is the most successful. The MDM is able to sample tokens
out-of-order and, ostensibly, many tokens at once and in parallel. However,
there is very limited rigorous understanding of how much parallel sampling
these models can perform without noticeable degradation in their sampling
performance. Prior work of Li and Cai obtained some preliminary bounds, but
these are not tight for many natural classes of distributions. In this work, we
give a new, exact characterization of the expected divergence between the true
distribution and the sampled distribution, for any distribution and any
unmasking schedule for the sampler, showing an elegant connection to the theory
of univariate function approximation.
By leveraging this connection, we then attain a number of novel lower and
upper bounds for this problem. While the connection to function approximation
in principle gives the optimal unmasking schedule for any distribution, we show
that it is in general impossible to compete with it without strong a priori
knowledge of the distribution, even in seemingly benign settings. However, we
also demonstrate new upper bounds and new sampling schedules in terms of
well-studied information-theoretic properties of the base distribution, namely,
its total correlation and dual total correlation, which show that in some
natural settings, one can sample in $O(log n)$ steps without any visible loss
in performance, where $n$ is the total sequence length.
School of Computer Sciene
Why we think this paper is great for you:
This paper introduces a novel Prompt-Expert Mixture Framework, offering a new perspective on scalable and generalizable graph foundation models. Its architectural approach will be of particular interest to you.
Abstract
Graph Neural Networks (GNNs) have demonstrated impressive performance on
task-specific benchmarks, yet their ability to generalize across diverse
domains and tasks remains limited. Existing approaches often struggle with
negative transfer, scalability issues, and high adaptation costs. To address
these challenges, we propose GMoPE (Graph Mixture of Prompt-Experts), a novel
framework that seamlessly integrates the Mixture-of-Experts (MoE) architecture
with prompt-based learning for graphs. GMoPE leverages expert-specific prompt
vectors and structure-aware MoE routing to enable each expert to specialize in
distinct subdomains and dynamically contribute to predictions. To promote
diversity and prevent expert collapse, we introduce a soft orthogonality
constraint across prompt vectors, encouraging expert specialization and
facilitating a more balanced expert utilization. Additionally, we adopt a
prompt-only fine-tuning strategy that significantly reduces spatiotemporal
complexity during transfer. We validate GMoPE through extensive experiments
under various pretraining strategies and multiple downstream tasks. Results
show that GMoPE consistently outperforms state-of-the-art baselines and
achieves performance comparable to full parameter fine-tuning-while requiring
only a fraction of the adaptation overhead. Our work provides a principled and
scalable framework for advancing generalizable and efficient graph foundation
models.
Wrocaw University of Sc
Why we think this paper is great for you:
This paper presents a new family of large language models, providing a comprehensive overview of their development and characteristics. It offers direct insights into the construction and capabilities of significant deep learning models.
Abstract
Large Language Models (LLMs) play a central role in modern artificial
intelligence, yet their development has been primarily focused on English,
resulting in limited support for other languages. We present PLLuM (Polish
Large Language Model), the largest open-source family of foundation models
tailored specifically for the Polish language. Developed by a consortium of
major Polish research institutions, PLLuM addresses the need for high-quality,
transparent, and culturally relevant language models beyond the English-centric
commercial landscape. We describe the development process, including the
construction of a new 140-billion-token Polish text corpus for pre-training, a
77k custom instructions dataset, and a 100k preference optimization dataset. A
key component is a Responsible AI framework that incorporates strict data
governance and a hybrid module for output correction and safety filtering. We
detail the models' architecture, training procedures, and alignment techniques
for both base and instruction-tuned variants, and demonstrate their utility in
a downstream task within public administration. By releasing these models
publicly, PLLuM aims to foster open research and strengthen sovereign AI
technologies in Poland.
UCLA, Columbia University
Why we think this paper is great for you:
This paper focuses on robust multimodal spatiotemporal learning using conditioned neural fields, addressing challenges in integrating diverse data types. You will appreciate its approach to handling complex real-world data.
Abstract
Multimodal spatiotemporal learning on real-world experimental data is
constrained by two challenges: within-modality measurements are sparse,
irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of
available modalities varies across space and time, shrinking the usable record
unless models can adapt to arbitrary subsets at train and test time. We propose
OmniField, a continuity-aware framework that learns a continuous neural field
conditioned on available modalities and iteratively fuses cross-modal context.
A multimodal crosstalk block architecture paired with iterative cross-modal
refinement aligns signals prior to the decoder, enabling unified
reconstruction, interpolation, forecasting, and cross-modal prediction without
gridding or surrogate preprocessing. Extensive evaluations show that OmniField
consistently outperforms eight strong multimodal spatiotemporal baselines.
Under heavy simulated sensor noise, performance remains close to clean-input
levels, highlighting robustness to corrupted measurements.
University of Technology
Why we think this paper is great for you:
This paper explores enhancing multimodal recommendations through Vision-Language Models and information-aware fusion techniques. It provides valuable methods for improving representation quality by combining different content sources.
Abstract
Recent advances in multimodal recommendation (MMR) have shown that
incorporating rich content sources such as images and text can lead to
significant gains representation quality. However, existing methods often rely
on coarse visual features and uncontrolled fusion, leading to redundant or
misaligned representations. As a result, visual encoders often fail to capture
salient, item-relevant semantics, limiting their contribution in multimodal
fusion. From an information-theoretic perspective, effective fusion should
balance the unique, shared, and redundant information across modalities,
preserving complementary cues while avoiding correlation bias. This paper
presents VLIF, a vision-language and information-theoretic fusion framework
that enhances multimodal recommendation through two key components. (i) A
VLM-based visual enrichment module generates fine-grained, title-guided
descriptions to transform product images into semantically aligned
representations. (ii) An information-aware fusion module, inspired by Partial
Information Decomposition (PID), disentangles redundant and synergistic signals
across modalities for controlled integration. Experiments on three Amazon
datasets demonstrate that VLIF consistently outperforms recent multimodal
baselines and substantially strengthens the contribution of visual features.
Deep Learning Architectures
City St Georges, Univer
Abstract
Artificial Intelligence (AI) is a powerful new language of science as
evidenced by recent Nobel Prizes in chemistry and physics that recognized
contributions to AI applied to those areas. Yet, this new language lacks
semantics, which makes AI's scientific discoveries unsatisfactory at best. With
the purpose of uncovering new facts but also improving our understanding of the
world, AI-based science requires formalization through a framework capable of
translating insight into comprehensible scientific knowledge. In this paper, we
argue that logic offers an adequate framework. In particular, we use logic in a
neurosymbolic framework to offer a much needed semantics for deep learning, the
neural network-based technology of current AI. Deep learning and neurosymbolic
AI lack a general set of conditions to ensure that desirable properties are
satisfied. Instead, there is a plethora of encoding and knowledge extraction
approaches designed for particular cases. To rectify this, we introduced a
framework for semantic encoding, making explicit the mapping between neural
networks and logic, and characterizing the common ingredients of the various
existing approaches. In this paper, we describe succinctly and exemplify how
logical semantics and neural networks are linked through this framework, we
review some of the most prominent approaches and techniques developed for
neural encoding and knowledge extraction, provide a formal definition of our
framework, and discuss some of the difficulties of identifying a semantic
encoding in practice in light of analogous problems in the philosophy of mind.
Korea Advanced Institute
Abstract
We present the bulk-boundary decomposition as a new framework for
understanding the training dynamics of deep neural networks. Starting from the
stochastic gradient descent formulation, we show that the Lagrangian can be
reorganized into a data-independent bulk term and a data-dependent boundary
term. The bulk captures the intrinsic dynamics set by network architecture and
activation functions, while the boundary reflects stochastic interactions from
training samples at the input and output layers. This decomposition exposes the
local and homogeneous structure underlying deep networks. As a natural
extension, we develop a field-theoretic formulation of neural dynamics based on
this decomposition.