Hi J34Nc4Rl0+Deep Learning,

Your personalized paper recommendations for 03 to 07 November, 2025.

Dear user, for this week we added the possiblity to further personalize your results by adding a personalized description of yourself.

Login in our website and head to the profile tab. There provide any details you want like your profession, age, background. That is then taken into account for the language models to generate something tailored for you.

🎯 Top Personalized Recommendations

CryptoMoE: Privacy-Preserving and Scalable Mixture of Experts Inference via Balanced Expert Routing

Peking University, Byteda

Why we think this paper is great for you:
This paper explores scalable and privacy-preserving inference for Mixture of Experts architectures, which is highly relevant to your focus on advanced model architectures. It also delves into the practical deployment of large language models.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Private large language model (LLM) inference based on cryptographic primitives offers a promising path towards privacy-preserving deep learning. However, existing frameworks only support dense LLMs like LLaMA-1 and struggle to scale to mixture-of-experts (MoE) architectures. The key challenge comes from securely evaluating the dynamic routing mechanism in MoE layers, which may reveal sensitive input information if not fully protected. In this paper, we propose CryptoMoE, the first framework that enables private, efficient, and accurate inference for MoE-based models. CryptoMoE balances expert loads to protect expert routing information and proposes novel protocols for secure expert dispatch and combine. CryptoMoE also develops a confidence-aware token selection strategy and a batch matrix multiplication protocol to improve accuracy and efficiency further. Extensive experiments on DeepSeekMoE-16.4B, OLMoE-6.9B, and QWenMoE-14.3B show that CryptoMoE achieves $2.8\sim3.5\times$ end-to-end latency reduction and $2.9\sim4.3\times$ communication reduction over a dense baseline with minimum accuracy loss. We also adapt CipherPrune (ICLR'25) for MoE inference and demonstrate CryptoMoE can reduce the communication by up to $4.3 \times$. Code is available at: https://github.com/PKU-SEC-Lab/CryptoMoE.

AI Summary

Experimental evaluations demonstrate CryptoMoE achieves 2.8-3.5x end-to-end latency reduction and 2.9-4.3x communication reduction over a dense baseline, while retaining 99.2% of original accuracy on average. [3]
The framework proposes a novel confidence-aware secure dispatch protocol that re-ranks tokens by routing confidence and selects the top-t tokens for each expert, mitigating accuracy degradation from token dropping. [2]
The selection of 't' (number of tokens per expert) is critical for balancing accuracy and efficiency, with t=2mk/n empirically providing a robust trade-off across various model configurations and sequence lengths. [2]
Confidence-aware selection strategy: A method to mitigate accuracy loss from token dropping by re-ranking tokens assigned to an expert based on their routing confidence and retaining only the top-t tokens. [2]
CryptoMoE introduces Inference-Time Balanced Expert Routing, ensuring each expert processes a fixed number of tokens (t) to prevent routing information leakage, a key challenge for privacy-preserving MoE inference. [1]
A lightweight and secure one-hot-based combine protocol is designed to efficiently aggregate expert outputs and reconstruct token-wise results while preserving privacy and addressing token reordering challenges. [1]
CryptoMoE significantly improves computational efficiency by introducing an efficient Batch Ciphertext-Plaintext Matrix Multiplication protocol, which packs partial token embeddings from all experts into a single ciphertext, reducing costly HE rotations from O(nd1) to O(d1). [1]
Compared to an adaptation of CipherPrune for MoE, CryptoMoE achieves up to 4.3x communication reduction and 2.4x latency reduction, highlighting the efficiency of its custom dispatch and combine protocols. [1]
Inference-Time Balanced Expert Routing: A core concept in CryptoMoE where each expert processes a fixed number of tokens (t), regardless of actual routing results, to ensure expert contributions are input-independent and preserve privacy. [1]
Batch Ciphertext-Plaintext Matrix Multiplication (Batch MatMul) protocol: An optimization that packs partial token embeddings from multiple experts into a single ciphertext, reducing the number of expensive Homomorphic Encryption (HE) rotation operations in linear layer computations. [1]

Evolutionary Optimization Trumps Adam Optimization on Embedding Space Exploration

University of Coimbra

Why we think this paper is great for you:
This paper offers valuable insights into optimizing deep generative models, particularly diffusion architectures, by comparing evolutionary algorithms with Adam. You will find its discussion on optimization techniques highly pertinent.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Deep generative models, especially diffusion architectures, have transformed image generation; however, they are challenging to control and optimize for specific goals without expensive retraining. Embedding Space Exploration, especially with Evolutionary Algorithms (EAs), has been shown to be a promising method for optimizing image generation, particularly within Diffusion Models. Therefore, in this work, we study the performance of an evolutionary optimization method, namely Separable Covariance Matrix Adaptation Evolution Strategy (sep-CMA-ES), against the widely adopted Adaptive Moment Estimation (Adam), applied to Stable Diffusion XL Turbo's prompt embedding vector. The evaluation of images combines the LAION Aesthetic Predictor V2 with CLIPScore into a weighted fitness function, allowing flexible trade-offs between visual appeal and adherence to prompts. Experiments on a subset of the Parti Prompts (P2) dataset showcase that sep-CMA-ES consistently yields superior improvements in aesthetic and alignment metrics in comparison to Adam. Results indicate that the evolutionary method provides efficient, gradient-free optimization for diffusion models, enhancing controllability without the need for fine-tuning. This study emphasizes the potential of evolutionary methods for embedding space exploration of deep generative models and outlines future research directions.

Optimal Inference Schedules for Masked Diffusion Models

Harvard, UW

Why we think this paper is great for you:
This paper investigates optimal inference schedules for masked diffusion models, directly addressing efficiency challenges in these powerful generative models. It provides crucial information on optimizing their performance.

Rate paper: 👍 👎 ♥ Save

Abstract
A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called diffusion language models, of which the masked diffusion model (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work of Li and Cai obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, exact characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of univariate function approximation. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong a priori knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its total correlation and dual total correlation, which show that in some natural settings, one can sample in $O(log n)$ steps without any visible loss in performance, where $n$ is the total sequence length.

GMoPE:A Prompt-Expert Mixture Framework for Graph Foundation Models

School of Computer Sciene

Why we think this paper is great for you:
This paper introduces a novel Prompt-Expert Mixture Framework, offering a new perspective on scalable and generalizable graph foundation models. Its architectural approach will be of particular interest to you.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Graph Neural Networks (GNNs) have demonstrated impressive performance on task-specific benchmarks, yet their ability to generalize across diverse domains and tasks remains limited. Existing approaches often struggle with negative transfer, scalability issues, and high adaptation costs. To address these challenges, we propose GMoPE (Graph Mixture of Prompt-Experts), a novel framework that seamlessly integrates the Mixture-of-Experts (MoE) architecture with prompt-based learning for graphs. GMoPE leverages expert-specific prompt vectors and structure-aware MoE routing to enable each expert to specialize in distinct subdomains and dynamically contribute to predictions. To promote diversity and prevent expert collapse, we introduce a soft orthogonality constraint across prompt vectors, encouraging expert specialization and facilitating a more balanced expert utilization. Additionally, we adopt a prompt-only fine-tuning strategy that significantly reduces spatiotemporal complexity during transfer. We validate GMoPE through extensive experiments under various pretraining strategies and multiple downstream tasks. Results show that GMoPE consistently outperforms state-of-the-art baselines and achieves performance comparable to full parameter fine-tuning-while requiring only a fraction of the adaptation overhead. Our work provides a principled and scalable framework for advancing generalizable and efficient graph foundation models.

PLLuM: A Family of Polish Large Language Models

Wrocaw University of Sc

Why we think this paper is great for you:
This paper presents a new family of large language models, providing a comprehensive overview of their development and characteristics. It offers direct insights into the construction and capabilities of significant deep learning models.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.

OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning

UCLA, Columbia University

Why we think this paper is great for you:
This paper focuses on robust multimodal spatiotemporal learning using conditioned neural fields, addressing challenges in integrating diverse data types. You will appreciate its approach to handling complex real-world data.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.

Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion

University of Technology

Why we think this paper is great for you:
This paper explores enhancing multimodal recommendations through Vision-Language Models and information-aware fusion techniques. It provides valuable methods for improving representation quality by combining different content sources.

Rate paper: 👍 👎 ♥ Save

Abstract
Recent advances in multimodal recommendation (MMR) have shown that incorporating rich content sources such as images and text can lead to significant gains representation quality. However, existing methods often rely on coarse visual features and uncontrolled fusion, leading to redundant or misaligned representations. As a result, visual encoders often fail to capture salient, item-relevant semantics, limiting their contribution in multimodal fusion. From an information-theoretic perspective, effective fusion should balance the unique, shared, and redundant information across modalities, preserving complementary cues while avoiding correlation bias. This paper presents VLIF, a vision-language and information-theoretic fusion framework that enhances multimodal recommendation through two key components. (i) A VLM-based visual enrichment module generates fine-grained, title-guided descriptions to transform product images into semantically aligned representations. (ii) An information-aware fusion module, inspired by Partial Information Decomposition (PID), disentangles redundant and synergistic signals across modalities for controlled integration. Experiments on three Amazon datasets demonstrate that VLIF consistently outperforms recent multimodal baselines and substantially strengthens the contribution of visual features.

Deep Learning Architectures

Neurosymbolic Deep Learning Semantics

City St Georges, Univer

Rate paper: 👍 👎 ♥ Save

Abstract
Artificial Intelligence (AI) is a powerful new language of science as evidenced by recent Nobel Prizes in chemistry and physics that recognized contributions to AI applied to those areas. Yet, this new language lacks semantics, which makes AI's scientific discoveries unsatisfactory at best. With the purpose of uncovering new facts but also improving our understanding of the world, AI-based science requires formalization through a framework capable of translating insight into comprehensible scientific knowledge. In this paper, we argue that logic offers an adequate framework. In particular, we use logic in a neurosymbolic framework to offer a much needed semantics for deep learning, the neural network-based technology of current AI. Deep learning and neurosymbolic AI lack a general set of conditions to ensure that desirable properties are satisfied. Instead, there is a plethora of encoding and knowledge extraction approaches designed for particular cases. To rectify this, we introduced a framework for semantic encoding, making explicit the mapping between neural networks and logic, and characterizing the common ingredients of the various existing approaches. In this paper, we describe succinctly and exemplify how logical semantics and neural networks are linked through this framework, we review some of the most prominent approaches and techniques developed for neural encoding and knowledge extraction, provide a formal definition of our framework, and discuss some of the difficulties of identifying a semantic encoding in practice in light of analogous problems in the philosophy of mind.

Bulk-boundary decomposition of neural networks

Korea Advanced Institute

Rate paper: 👍 👎 ♥ Save

Abstract
We present the bulk-boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a natural extension, we develop a field-theoretic formulation of neural dynamics based on this decomposition.

Deep Learning

Applying Time Series Deep Learning Models to Forecast the Growth of Perennial Ryegrass in Ireland

VISTAMILK, Dublin City Un

Rate paper: 👍 👎 ♥ Save

Abstract
Grasslands, constituting the world's second-largest terrestrial carbon sink, play a crucial role in biodiversity and the regulation of the carbon cycle. Currently, the Irish dairy sector, a significant economic contributor, grapples with challenges related to profitability and sustainability. Presently, grass growth forecasting relies on impractical mechanistic models. In response, we propose deep learning models tailored for univariate datasets, presenting cost-effective alternatives. Notably, a temporal convolutional network designed for forecasting Perennial Ryegrass growth in Cork exhibits high performance, leveraging historical grass height data with RMSE of 2.74 and MAE of 3.46. Validation across a comprehensive dataset spanning 1,757 weeks over 34 years provides insights into optimal model configurations. This study enhances our understanding of model behavior, thereby improving reliability in grass growth forecasting and contributing to the advancement of sustainable dairy farming practices.

Deep Learning Models

Generalization in Representation Models via Random Matrix Theory: Application to Recurrent Networks

Huawei Noahs Ark Lab, Hu

Rate paper: 👍 👎 ♥ Save

Abstract
We first study the generalization error of models that use a fixed feature representation (frozen intermediate layers) followed by a trainable readout layer. This setting encompasses a range of architectures, from deep random-feature models to echo-state networks (ESNs) with recurrent dynamics. Working in the high-dimensional regime, we apply Random Matrix Theory to derive a closed-form expression for the asymptotic generalization error. We then apply this analysis to recurrent representations and obtain concise formula that characterize their performance. Surprisingly, we show that a linear ESN is equivalent to ridge regression with an exponentially time-weighted (''memory'') input covariance, revealing a clear inductive bias toward recent inputs. Experiments match predictions: ESNs win in low-sample, short-memory regimes, while ridge prevails with more data or long-range dependencies. Our methodology provides a general framework for analyzing overparameterized models and offers insights into the behavior of deep learning networks.

Large Language Models

Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

Kyungpook National Univer

Rate paper: 👍 👎 ♥ Save

Abstract
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.

Diffusion Models

Diffusion Models are Robust Pretrainers

Tel Aviv University

Rate paper: 👍 👎 ♥ Save

Abstract
Diffusion models have gained significant attention for high-fidelity image generation. Our work investigates the potential of exploiting diffusion models for adversarial robustness in image classification and object detection. Adversarial attacks challenge standard models in these tasks by perturbing inputs to force incorrect predictions. To address this issue, many approaches use training schemes for forcing the robustness of the models, which increase training costs. In this work, we study models built on top of off-the-shelf diffusion models and demonstrate their practical significance: they provide a low-cost path to robust representations, allowing lightweight heads to be trained on frozen features without full adversarial training. Our empirical evaluations on ImageNet, CIFAR-10, and PASCAL VOC show that diffusion-based classifiers and detectors achieve meaningful adversarial robustness with minimal compute. While clean and adversarial accuracies remain below state-of-the-art adversarially trained CNNs or ViTs, diffusion pretraining offers a favorable tradeoff between efficiency and robustness. This work opens a promising avenue for integrating diffusion models into resource-constrained robust deployments.

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback