Deep Learning Architectures

Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between

Czech Technical Universty

Rate this image: 😍 👍 👎

Abstract
We introduce Universal Neural Architecture Space (UniNAS), a generic search space for neural architecture search (NAS) which unifies convolutional networks, transformers, and their hybrid architectures under a single, flexible framework. Our approach enables discovery of novel architectures as well as analyzing existing architectures in a common framework. We also propose a new search algorithm that allows traversing the proposed search space, and demonstrate that the space contains interesting architectures, which, when using identical training setup, outperform state-of-the-art hand-crafted architectures. Finally, a unified toolkit including a standardized training and evaluation protocol is introduced to foster reproducibility and enable fair comparison in NAS research. Overall, this work opens a pathway towards systematically exploring the full spectrum of neural architectures with a unified graph-based NAS perspective.

AI Insights

Zero‑shot NAS can rank architectures without any training, dramatically cutting search time.
Differentiable NAS turns the search into a gradient‑based optimization, enabling continuous architecture spaces.
Performance‑prediction models estimate accuracy from proxy metrics, sidestepping costly training loops.
The biggest bottleneck remains the sheer volume of data and compute required to validate candidate designs.
A deeper theoretical understanding of why certain architectural motifs thrive could unlock more efficient search heuristics.
Robust NAS must scale to massive datasets and complex models while maintaining reproducibility across experiments.
The proposed graph‑based framework unifies ConvNets, Transformers, and hybrids, opening a playground for hybrid‑architecture discovery.

👍 👎 ♥ Save

Optimally Deep Networks -- Adapting Model Depth to Datasets for Superior Efficiency

Abstract
Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.

Deep Learning

👍 👎 ♥ Save

Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage

Rate this image: 😍 👍 👎

Abstract
Recent advances in machine learning such as Long Short-Term Memory (LSTM) models and Transformers have been widely adopted in hydrological applications, demonstrating impressive performance amongst deep learning models and outperforming physical models in various tasks. However, their superiority in predicting land surface states such as terrestrial water storage (TWS) that are dominated by many factors such as natural variability and human driven modifications remains unclear. Here, using the open-access, globally representative HydroGlobe dataset - comprising a baseline version derived solely from a land surface model simulation and an advanced version incorporating multi-source remote sensing data assimilation - we show that linear regression is a robust benchmark, outperforming the more complex LSTM and Temporal Fusion Transformer for TWS prediction. Our findings highlight the importance of including traditional statistical models as benchmarks when developing and evaluating deep learning models. Additionally, we emphasize the critical need to establish globally representative benchmark datasets that capture the combined impact of natural variability and human interventions.

👍 👎 ♥ Save

An in-depth look at approximation via deep and narrow neural networks

University of Hamburg

Abstract
In 2017, Hanin and Sellke showed that the class of arbitrarily deep, real-valued, feed-forward and ReLU-activated networks of width w forms a dense subset of the space of continuous functions on R^n, with respect to the topology of uniform convergence on compact sets, if and only if w>n holds. To show the necessity, a concrete counterexample function f:R^n->R was used. In this note we actually approximate this very f by neural networks in the two cases w=n and w=n+1 around the aforementioned threshold. We study how the approximation quality behaves if we vary the depth and what effect (spoiler alert: dying neurons) cause that behavior.

AI Insights

Depth lowers error until dying ReLU forces a constant output, even when width equals input dimension.
With width n+1, deeper nets keep improving, showing w>n is not a hard limit.
Minimal‑width ReLU nets can approximate any continuous function, confirming Hanin & Sellke’s theorem.
The constant N0≡1/8 is the best uniform approximator for the counterexample, achieving error 1/8 for all depths.
Experiments show the depth‑benefit plateau occurs earlier in higher dimensions due to dying neurons.
Beise et al.’s decision‑region analysis explains constant outputs in narrow deep nets.
Bresler & Nagaraj’s sharp representation theorems give a depth‑dependence framework matching the results.

Diffusion Models

👍 👎 ♥ Save

What is the most optimal diffusion?

arXiv

Abstract
What is the fastest possible "diffusion"? A trivial answer would be "a process that converts a Dirac delta-function into a uniform distribution infinitely fast". Below, we consider a more reasonable formulation: a process that maximizes differential entropy of a probability density function (pdf) $f(\vec{x}, t)$ at every time $t$, under certain restrictions. Specifically, we focus on a case when the rate of the Kullback-Leibler divergence $D_{\text{KL}}$ is fixed. If $\Delta(\vec{x}, t, d{t}) = \frac{\partial f}{ \partial t} d{t}$ is the pdf change at a time step $d{t}$, we maximize the differential entropy $H[f + \Delta]$ under the restriction $D_{\text{KL}}(f + \Delta || f) = A^2 d{t}^2$, $A = \text{const} > 0$. It leads to the following equation: $\frac{\partial f}{ \partial t} = - \kappa f (\ln{f} - \int f \ln{f} d{\vec{x}})$, with $\kappa = \frac{A}{\sqrt{ \int f \ln^2{f} d{\vec{x}} - \left( \int f \ln{f} d{\vec{x}} \right)^2 } }$. Notably, this is a non-local equation, so the process is different from the It\^{o} diffusion and a corresponding Fokker-Planck equation. We show that the normal and exponential distributions are solutions to this equation, on $(-\infty; \infty)$ and $[0; \infty)$, respectively, both with $\text{variance} \sim e^{2 A t}$, i.e. diffusion is highly anomalous. We numerically demonstrate for sigmoid-like functions on a segment that the entropy change rate $\frac{d H}{d t}$ produced by such an optimal "diffusion" is, as expected, higher than produced by the "classical" diffusion.

AI Insights

The derived PDE is non‑local, involving global integrals of f ln f, distinguishing it from Itô/Fokker‑Planck dynamics.
Maximizing differential entropy under a fixed KL‑divergence rate yields the evolution equation ∂f/∂t = –κ f(ln f – ⟨ln f⟩).
Normal and exponential laws emerge as exact stationary solutions, with variance growing as e^{2At}, an extreme anomalous diffusion.
Variational calculus on the KL functional provides the bridge between entropy production and information loss.
Numerical tests on sigmoid initial data confirm that the optimal diffusion produces a higher dH/dt than classical Brownian motion.
The framework invites applications in physics, engineering, and machine learning where rapid entropy maximization is desired.
Key references include “Variational Methods in Nonlinear Differential Equations” and foundational papers on KL divergence and entropy.

👍 👎 ♥ Save

Thermodynamic Performance Limits for Score-Based Diffusion Models

Case Western Reserve Unv

Abstract
We establish a fundamental connection between score-based diffusion models and non-equilibrium thermodynamics by deriving performance limits based on entropy rates. Our main theoretical contribution is a lower bound on the negative log-likelihood of the data that relates model performance to entropy rates of diffusion processes. We numerically validate this bound on a synthetic dataset and investigate its tightness. By building a bridge to entropy rates - system, intrinsic, and exchange entropy - we provide new insights into the thermodynamic operation of these models, drawing parallels to Maxwell's demon and implications for thermodynamic computing hardware. Our framework connects generative modeling performance to fundamental physical principles through stochastic thermodynamics.

AI Insights

NLL is split into equilibrium entropy, dataset entropy, score‑norm, and squared‑difference, with closed‑form for some terms.
Three error sources—Monte‑Carlo noise, quadrature error, and goodness‑of‑fit bias—are quantified to improve precision.
Idiff and cosine similarity diagnostics flag score mismatches, revealing over‑ or under‑fitting early.
Synthetic experiments show the entropy‑rate bound tight when the model captures true diffusion dynamics.
Assuming known true scores limits practicality, motivating research on score‑estimation uncertainty.
The Maxwell’s demon analogy suggests thermodynamic hardware could run diffusion models with lower energy footprints.
For deeper insight, read Ho et al.’s “Diffusion‑Based Generative Models” and Gupta et al.’s “Improved Techniques for Training Score‑Based Models.”

Multimodal Learning

👍 👎 ♥ Save

ContextNav: Towards Agentic Multimodal In-Context Learning

The University of Queensl

Rate this image: 😍 👍 👎

Abstract
Recent advances demonstrate that multimodal large language models (MLLMs) exhibit strong multimodal in-context learning (ICL) capabilities, enabling them to adapt to novel vision-language tasks from a few contextual examples. However, existing ICL approaches face challenges in reconciling scalability with robustness across diverse tasks and noisy contextual examples: manually selecting examples produces clean contexts but is labor-intensive and task-specific, while similarity-based retrieval improves scalability but could introduce irrelevant or structurally inconsistent samples that degrade ICL performance. To address these limitations, we propose ContextNav, the first agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation, enabling noise-robust and dynamically optimized contextualization for multimodal ICL. ContextNav unifies context management and noise-robust contextualization within a closed-loop workflow driven by graph-based orchestration. Specifically, it builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts. An Operational Grammar Graph (OGG) further supports adaptive workflow planning and optimization, enabling the agent to refine its operational strategies based on downstream ICL feedback. Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets, underscoring the promise of agentic workflows for advancing scalable and robust contextualization in multimodal ICL.

AI Insights

Adding richer multimodal context boosts accuracy, confirming sensitivity to context volume.
The system still falters on stylized text in complex scenes, exposing OCR robustness gaps.
Font, color, and background variations dramatically alter recognition rates, demanding diverse training.
Higher resolution consistently improves text extraction, linking pixel fidelity to confidence.
Future work should integrate CRNN or Transformer‑based OCR to bridge current gaps.
Key resources: 'Deep Learning' by Goodfellow et al. and 'Computer Vision: Algorithms and Applications' by Szeliski.
Contextualization Examples are curated multimodal snippets guiding the agent’s retrieval strategy.

👍 👎 ♥ Save

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

MIT CSAIL, TU Munich

Abstract
Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/

AI Insights

Unpaired text sharpens decision boundaries in few‑shot image classification, boosting accuracy.
The model detects sarcasm by measuring agreement between modalities, not by content alone.
Confidence scores rise when auxiliary modalities are incorporated, improving calibration.
Multimodal Neurons learn shared embeddings across vision, language, and audio in a single network.
Functional Margin quantifies how far samples lie from the decision boundary, guiding training.
Silhouette Score is used to assess cluster separability after multimodal fusion.
Recommended reading: “Unsupervised Multimodal Alignment for Few‑Shot Classification (2022)” and “Multimodal Co‑Training for Unpaired Data (2020)”.

Deep Learning Optimization

👍 👎 ♥ Save

Computing frustration and near-monotonicity in deep neural networks

Linkping University, SE

Abstract
For the signed graph associated to a deep neural network, one can compute the frustration level, i.e., test how close or distant the graph is to structural balance. For all the pretrained deep convolutional neural networks we consider, we find that the frustration is always less than expected from null models. From a statistical physics point of view, and in particular in reference to an Ising spin glass model, the reduced frustration indicates that the amount of disorder encoded in the network is less than in the null models. From a functional point of view, low frustration (i.e., proximity to structural balance) means that the function representing the network behaves near-monotonically, i.e., more similarly to a monotone function than in the null models. Evidence of near-monotonic behavior along the partial order determined by frustration is observed for all networks we consider. This confirms that the class of deep convolutional neural networks tends to have a more ordered behavior than expected from null models, and suggests a novel form of implicit regularization.

Large Language Models

👍 👎 ♥ Save

Large Language Model Sourcing: A Survey

Rate this image: 😍 👍 👎

Abstract
The rapid advancement of large language models (LLMs) has revolutionized artificial intelligence, shifting from supporting objective tasks (e.g., recognition) to empowering subjective decision-making (e.g., planning, decision). This marks the dawn of general and powerful AI, with applications spanning a wide range of fields, including programming, education, healthcare, finance, and law. However, their deployment introduces multifaceted risks. Due to the black-box nature of LLMs and the human-like quality of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement become particularly significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation into provenance tracking for content generated by LLMs, organized around four interrelated dimensions that together capture both model- and data-centric perspectives. From the model perspective, Model Sourcing treats the model as a whole, aiming to distinguish content generated by specific LLMs from content authored by humans. Model Structure Sourcing delves into the internal generative mechanisms, analyzing architectural components that shape the outputs of model. From the data perspective, Training Data Sourcing focuses on internal attribution, tracing the origins of generated content back to the training data of model. In contrast, External Data Sourcing emphasizes external validation, identifying external information used to support or influence the responses of model. Moreover, we also propose a dual-paradigm taxonomy that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

👍 👎 ♥ Save

Embodiment in multimodal large language models

Abstract
Multimodal Large Language Models (MLLMs) have demonstrated extraordinary progress in bridging textual and visual inputs. However, MLLMs still face challenges in situated physical and social interactions in sensorally rich, multimodal and real-world settings where the embodied experience of the living organism is essential. We posit that next frontiers for MLLM development require incorporating both internal and external embodiment -- modeling not only external interactions with the world, but also internal states and drives. Here, we describe mechanisms of internal and external embodiment in humans and relate these to current advances in MLLMs in early stages of aligning to human representations. Our dual-embodied framework proposes to model interactions between these forms of embodiment in MLLMs to bridge the gap between multimodal data and world experience.

Mixture of Experts

👍 👎 ♥ Save

Mixture of Neuron Experts

Tsinghua Shenzhen Interna

Abstract
In this work, we first explore whether the parameters activated by the MoE layer remain highly sparse at inference. We perform a sparsification study on several representative MoE models. For each expert, we rank parameters by the magnitude of their activations from the gate projection and progressively prune the activated subset. Pruning up to 60% of parameters within that subset causes only negligible task-performance degradation; substantial drops occur only after more than 90% are removed. We further decompose experts into neuron-granular MoE and visualize their activation values, finding that most neuron activations are near zero. This observation motivates us to select only high-activation neuron experts during pretraining. Based on this insight, we propose Mixture of Neuron Experts (MoNE). MoNE achieves neuron-granular expert selection by only applying a simple top-k selection within each expert, incurs negligible latency, and requires no additional routing parameters or inter-expert communication. Extensive experiments demonstrate that MoNE matches traditional MoE performance while activating only 50% of the MoE-layer parameters, and it consistently outperforms traditional MoE when compared at equal numbers of activated parameters. These results suggest that MoNE is a practical approach to improving parameter utilization and inference efficiency in MoE-like models.

AI Insights

MoNE’s top‑k neuron selection eliminates 50 % of MoE parameters while matching accuracy.
An auxiliary load‑balance loss L_aux is the key driver of MoNE’s superior parameter utilization.
Pruning 60 % of the most active parameters per expert causes negligible accuracy loss; >90 % pruning degrades performance.
MoNE requires no extra routing parameters or inter‑expert communication, keeping inference latency minimal.
Neuron‑level sparsity analysis shows most activations are near zero, motivating selective expert activation.
When matched on activated parameters, MoNE consistently outperforms traditional MoE in both training loss and inference efficiency.
The study demonstrates that balancing expert load via L_aux yields measurable gains in both parameter efficiency and model performance.

👍 👎 ♥ Save

Bayesian Decision Making around Experts

University of Oxford

Abstract
Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.

AI Insights

The TS variant updates posteriors with expert data weighted by a mutual‑information confidence term.
A particle‑filter lets the learner choose online whether to trust its own reward or the expert’s outcome, using a flexible prior.
Experiments show zero regret even with noisy experts, thanks to Bayesian safeguards that down‑weight misleading samples.
In adversarial tests where experts report optimal actions from hostile settings, the agent still converges to the true optimum by treating expert data as a noisy oracle.
The regret bound scales with I(A;D_expert), an interpretable measure of how much expert data tightens guarantees.
The information‑directed rule picks the data source that maximizes one‑step information gain about the optimal arm, balancing exploration and exploitation.
For deeper context, read Bubeck’s “Thompson Sampling” and Lattimore’s “Bandits and Experts.”

Deep Learning Models

👍 👎 ♥ Save

From Detection to Mitigation: Addressing Bias in Deep Learning Models for Chest X-Ray Diagnosis

Rate this image: 😍 👍 👎

Abstract
Deep learning models have shown promise in improving diagnostic accuracy from chest X-rays, but they also risk perpetuating healthcare disparities when performance varies across demographic groups. In this work, we present a comprehensive bias detection and mitigation framework targeting sex, age, and race-based disparities when performing diagnostic tasks with chest X-rays. We extend a recent CNN-XGBoost pipeline to support multi-label classification and evaluate its performance across four medical conditions. We show that replacing the final layer of CNN with an eXtreme Gradient Boosting classifier improves the fairness of the subgroup while maintaining or improving the overall predictive performance. To validate its generalizability, we apply the method to different backbones, namely DenseNet-121 and ResNet-50, and achieve similarly strong performance and fairness outcomes, confirming its model-agnostic design. We further compare this lightweight adapter training method with traditional full-model training bias mitigation techniques, including adversarial training, reweighting, data augmentation, and active learning, and find that our approach offers competitive or superior bias reduction at a fraction of the computational cost. Finally, we show that combining eXtreme Gradient Boosting retraining with active learning yields the largest reduction in bias across all demographic subgroups, both in and out of distribution on the CheXpert and MIMIC datasets, establishing a practical and effective path toward equitable deep learning deployment in clinical radiology.

Help us improve your experience!