Deep Learning Architectures

Scaling Equilibrium Propagation to Deeper Neural Network Architectures

Indian Institute of Techn

Rate this image: 😍 👍 👎

Abstract
Equilibrium propagation has been proposed as a biologically plausible alternative to the backpropagation algorithm. The local nature of gradient computations, combined with the use of convergent RNNs to reach equilibrium states, make this approach well-suited for implementation on neuromorphic hardware. However, previous studies on equilibrium propagation have been restricted to networks containing only dense layers or relatively small architectures with a few convolutional layers followed by a final dense layer. These networks have a significant gap in accuracy compared to similarly sized feedforward networks trained with backpropagation. In this work, we introduce the Hopfield-Resnet architecture, which incorporates residual (or skip) connections in Hopfield networks with clipped $\mathrm{ReLU}$ as the activation function. The proposed architectural enhancements enable the training of networks with nearly twice the number of layers reported in prior works. For example, Hopfield-Resnet13 achieves 93.92\% accuracy on CIFAR-10, which is $\approx$3.5\% higher than the previous best result and comparable to that provided by Resnet13 trained using backpropagation.

👍 👎 ♥ Save

Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning

Nankai University, Tianjn

Abstract
In deep learning, dense layer connectivity has become a key design principle in deep neural networks (DNNs), enabling efficient information flow and strong performance across a range of applications. In this work, we model densely connected DNNs mathematically and analyze their learning problems in the deep-layer limit. For a broad applicability, we present our analysis in a framework setting of DNNs with densely connected layers and general non-local feature transformations (with local feature transformations as special cases) within layers, which is called dense non-local (DNL) framework and includes standard DenseNets and variants as special examples. In this formulation, the densely connected networks are modeled as nonlinear integral equations, in contrast to the ordinary differential equation viewpoint commonly adopted in prior works. We study the associated training problems from an optimal control perspective and prove convergence results from the network learning problem to its continuous-time counterpart. In particular, we show the convergence of optimal values and the subsequence convergence of minimizers, using a piecewise linear extension and $\Gamma$-convergence analysis. Our results provide a mathematical foundation for understanding densely connected DNNs and further suggest that such architectures can offer stability of training deep models.

AI Insights

Forward‑backward‑splitting networks converge in the deep‑layer limit, revealing new stability insights.
Learned primal‑dual schemes are modeled as dynamical systems with a linear operator K, enabling Lyapunov‑style analysis.
A piecewise‑linear extension links discrete layers to continuous time, yielding a Γ‑convergence proof of optimal values.
The framework includes DenseNet variants and non‑local feature transforms, suggesting unexplored hybrid architectures.
Brunner’s “Volterra Integral Equations” and Braides’ “Γ‑convergence for Beginners” are key resources for the theory.
The work builds on Haber, Lu, and Ruthotto’s PDE‑inspired DNN research, situating it in physics‑informed deep learning.
Though mathematically dense, the paper encourages experimenting with forward‑backward‑splitting layers for future stability gains.

Deep Learning

👍 👎 ♥ Save

A Generalized Information Bottleneck Theory of Deep Learning

University College London

Abstract
The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.

Diffusion Models

👍 👎 ♥ Save

Learn to Guide Your Diffusion Model

Google DeepMind, Gatsby U

Abstract
Classifier-free guidance (CFG) is a widely used technique for improving the perceptual quality of samples from conditional diffusion models. It operates by linearly combining conditional and unconditional score estimates using a guidance weight $\omega$. While a large, static weight can markedly improve visual results, this often comes at the cost of poorer distributional alignment. In order to better approximate the target conditional distribution, we instead learn guidance weights $\omega_{c,(s,t)}$, which are continuous functions of the conditioning $c$, the time $t$ from which we denoise, and the time $s$ towards which we denoise. We achieve this by minimizing the distributional mismatch between noised samples from the true conditional distribution and samples from the guided diffusion process. We extend our framework to reward guided sampling, enabling the model to target distributions tilted by a reward function $R(x_0,c)$, defined on clean data and a conditioning $c$. We demonstrate the effectiveness of our methodology on low-dimensional toy examples and high-dimensional image settings, where we observe improvements in Fr\'echet inception distance (FID) for image generation. In text-to-image applications, we observe that employing a reward function given by the CLIP score leads to guidance weights that improve image-prompt alignment.

👍 👎 ♥ Save

How Diffusion Models Memorize

Yonsei University, Korea

Abstract
Despite their success in image generation, diffusion models can memorize training data, raising serious privacy and copyright concerns. Although prior work has sought to characterize, detect, and mitigate memorization, the fundamental question of why and how it occurs remains unresolved. In this paper, we revisit the diffusion and denoising process and analyze latent space dynamics to address the question: "How do diffusion models memorize?" We show that memorization is driven by the overestimation of training samples during early denoising, which reduces diversity, collapses denoising trajectories, and accelerates convergence toward the memorized image. Specifically: (i) memorization cannot be explained by overfitting alone, as training loss is larger under memorization due to classifier-free guidance amplifying predictions and inducing overestimation; (ii) memorized prompts inject training images into noise predictions, forcing latent trajectories to converge and steering denoising toward their paired samples; and (iii) a decomposition of intermediate latents reveals how initial randomness is quickly suppressed and replaced by memorized content, with deviations from the theoretical denoising schedule correlating almost perfectly with memorization severity. Together, these results identify early overestimation as the central underlying mechanism of memorization in diffusion models.

Multimodal Learning

👍 👎 ♥ Save

Guiding Mixture-of-Experts with Temporal Multimodal Interactions

Johns Hopkins University

Rate this image: 😍 👍 👎

Abstract
Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.

AI Insights

The framework introduces a multi‑scale batch estimator that quantifies redundancy, uniqueness, and synergy across modalities.
A discriminator is first trained to separate real from synthetic modality pairs before alignment.
The alignment module learns an optimal distribution aligning heterogeneous modalities for downstream analysis.
RUS components are derived from mutual‑information estimates computed by the discriminator and alignment.
Each modality is encoded by a transformer‑based encoder that preserves temporal dynamics.
Attention‑based expert networks then compute cross‑modal synergy, guided by the RUS‑aware router.
The method assumes IID modalities, which may limit performance on highly entangled datasets.

👍 👎 ♥ Save

Semantic Compression via Multimodal Representation Learning

Sapienza University of Rm

Abstract
Multimodal representation learning produces high-dimensional embeddings that align diverse modalities in a shared latent space. While this enables strong generalization, it also introduces scalability challenges, both in terms of storage and downstream processing. A key open problem is how to achieve semantic compression, reducing the memory footprint of multimodal embeddings while preserving their ability to represent shared semantic content across modalities. In this paper, we prove a strong connection between reducing the modality gap, which is the residual separation of embeddings from different modalities, and the feasibility of post-training semantic compression. When the gap is sufficiently reduced, embeddings from different modalities but expressing the same semantics share a common portion of the space. Therefore, their centroid is a faithful representation of such a semantic concept. This enables replacing multiple embeddings with a single centroid, yielding significant memory savings. We propose a novel approach for semantic compression grounded on the latter intuition, operating directly on pretrained encoders. We demonstrate its effectiveness across diverse large-scale multimodal downstream tasks. Our results highlight that modality alignment is a key enabler for semantic compression, showing that the proposed approach achieves significant compression without sacrificing performance.

Deep Learning Optimization

👍 👎 ♥ Save

Machine Learning Algorithms for Improving Black Box Optimization Solvers

Fakult at f ur Mathemat

Abstract
Black-box optimization (BBO) addresses problems where objectives are accessible only through costly queries without gradients or explicit structure. Classical derivative-free methods -- line search, direct search, and model-based solvers such as Bayesian optimization -- form the backbone of BBO, yet often struggle in high-dimensional, noisy, or mixed-integer settings. Recent advances use machine learning (ML) and reinforcement learning (RL) to enhance BBO: ML provides expressive surrogates, adaptive updates, meta-learning portfolios, and generative models, while RL enables dynamic operator configuration, robustness, and meta-optimization across tasks. This paper surveys these developments, covering representative algorithms such as NNs with the modular model-based optimization framework (mlrMBO), zeroth-order adaptive momentum methods (ZO-AdaMM), automated BBO (ABBO), distributed block-wise optimization (DiBB), partition-based Bayesian optimization (SPBOpt), the transformer-based optimizer (B2Opt), diffusion-model-based BBO, surrogate-assisted RL for differential evolution (Surr-RLDE), robust BBO (RBO), coordinate-ascent model-based optimization with relative entropy (CAS-MORE), log-barrier stochastic gradient descent (LB-SGD), policy improvement with black-box (PIBB), and offline Q-learning with Mamba backbones (Q-Mamba). We also review benchmark efforts such as the NeurIPS 2020 BBO Challenge and the MetaBox framework. Overall, we highlight how ML and RL transform classical inexact solvers into more scalable, robust, and adaptive frameworks for real-world optimization.

👍 👎 ♥ Save

Benchmarking Deep Learning Convolutions on Energy-constrained CPUs

Sorbonne Universit, CNRS

Abstract
This work evaluates state-of-the-art convolution algorithms for CPU-based deep learning inference. While most prior studies focus on GPUs or NPUs, CPU implementations remain relatively underoptimized. We benchmark direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM __ , Intel __ , AMD __ , Apple __ , and Nvidia __ , considering both latency and energy efficiency. Our results highlight the key architectural factors that govern CPU efficiency for convolution operations, providing practical guidance for energy-aware embedded deployment. As a main results of this work, the Nvidia __ AGX Orin combined with the GEMM algorithm achieves the best trade-off between inference latency and energy consumption.

Large Language Models

👍 👎 ♥ Save

Information Design With Large Language Models

Harvard University,Google

Rate this image: 😍 👍 👎

Abstract
Information design is typically studied through the lens of Bayesian signaling, where signals shape beliefs based on their correlation with the true state of the world. However, Behavioral Economics and Psychology emphasize that human decision-making is more complex and can depend on how information is framed. This paper formalizes a language-based notion of framing and bridges this to the popular Bayesian-persuasion model. We model framing as a possibly non-Bayesian, linguistic way to influence a receiver's belief, while a signaling (or recommendation) scheme can further refine this belief in the classic Bayesian way. A key challenge in systematically optimizing in this framework is the vast space of possible framings and the difficulty of predicting their effects on receivers. Based on growing evidence that Large Language Models (LLMs) can effectively serve as proxies for human behavior, we formulate a theoretical model based on access to a framing-to-belief oracle. This model then enables us to precisely characterize when solely optimizing framing or jointly optimizing framing and signaling is tractable. We substantiate our theoretical analysis with an empirical algorithm that leverages LLMs to (1) approximate the framing-to-belief oracle, and (2) optimize over language space using a hill-climbing method. We apply this to two marketing-inspired case studies and validate the effectiveness through analytical and human evaluation.

👍 👎 ♥ Save

Large language models for behavioral modeling: A literature survey

Mid Sweden University, r

Abstract
In recent years, large language models (LLMs) have been extensively utilized for behavioral modeling, for example, to automatically generate sequence diagrams. However, no overview of this work has been published yet. Such an overview will help identify future research directions and inform practitioners and educators about the effectiveness of LLMs in assisting behavioral modeling. This study aims to provide an overview of the existing research on the use of LLMs for behavioral modeling, particularly focusing on use case and sequence diagrams. Through a term-based search, we filtered and identified 14 relevant primary studies. Our analysis of the selected primary studies reveals that LLMs have demonstrated promising results in automatically generating use case and sequence diagrams. In addition, we found that most of the current literature lacks expert-based evaluations and has mainly used GPT-based models. Therefore, future work should evaluate a broader range of LLMs for behavioral modeling and involve domain experts to evaluate the output of LLMs.

Mixture of Experts

👍 👎 ♥ Save

Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Morgan Stanley, Stanford

Abstract
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.

AI Insights

KERN outperforms Sigmoid and Tanh at every milestone, with the lowest variance among routers.
NormRouter normalizes by L2 norm, adds epsilon, and applies ReLU, costing zero extra.
Monte Carlo scale initialization gives KERN a more robust start than deterministic methods.
Careful activation and normalization choices cut seed‑to‑seed variance dramatically in MoE.
KERN unifies Sigmoid‑ and Softmax‑based routers under a single kernel‑inspired framework.
PyTorch tutorial and the KERN paper are key resources for hands‑on implementation.
Deep Learning book explains why L2‑normalization and ReLU synergize in KERN.

Deep Learning Models

👍 👎 ♥ Save

Batch-CAM: Introduction to better reasoning in convolutional deep learning models

ISTI, CNR, Via Giuseppe M

Abstract
Understanding the inner workings of deep learning models is crucial for advancing artificial intelligence, particularly in high-stakes fields such as healthcare, where accurate explanations are as vital as precision. This paper introduces Batch-CAM, a novel training paradigm that fuses a batch implementation of the Grad-CAM algorithm with a prototypical reconstruction loss. This combination guides the model to focus on salient image features, thereby enhancing its performance across classification tasks. Our results demonstrate that Batch-CAM achieves a simultaneous improvement in accuracy and image reconstruction quality while reducing training and inference times. By ensuring models learn from evidence-relevant information,this approach makes a relevant contribution to building more transparent, explainable, and trustworthy AI systems.

AI Insights

Batch‑CAM introduces a loss that penalizes misleading Grad‑CAM maps, forcing the network to align predictions with true salient regions.
The combined cross‑entropy plus explanation loss yields a 2–3% boost on MNIST and ImageNet‑subset benchmarks while cutting inference time by ~15%.
Batch‑CAM’s prototypical reconstruction term encourages feature maps to reconstruct the input, revealing a dual role as both classifier and auto‑encoder.
Despite its gains, the method’s extra Grad‑CAM forward passes can inflate GPU memory usage, limiting scalability to very deep models.
For deeper dives, see the original Grad‑CAM paper and the open‑source explainerdashboard repository for interactive visualizations.
Key definitions: Explainability = clear, accurate reasoning for predictions; Interpretability = human‑understandable internal mechanics.

Help us improve your experience!