Deep Learning Architectures

WiNPA: Wireless Neural Processing Architecture

University College London

Abstract
This article presents a wireless neural processing architecture (WiNPA), providing a novel perspective for accelerating edge inference of deep neural network (DNN) workloads via joint optimization of wireless and computing resources. WiNPA enables fine-grained integration of wireless communication and edge computing, bridging the research gap between wireless and edge intelligence and significantly improving DNN inference performance. To fully realize its potential, we explore a set of fundamental research issues, including mathematical modeling, optimization, and unified hardware--software platforms. Additionally, key research directions are discussed to guide future development and practical implementation. A case study demonstrates WiNPA's workflow and effectiveness in accelerating DNN inference through simulations.

👍 👎 ♥ Save

Message Passing on the Edge: Towards Scalable and Expressive GNNs

IMC UC, CENIA & IMFD, TUW

Abstract
We propose EB-1WL, an edge-based color-refinement test, and a corresponding GNN architecture, EB-GNN. Our architecture is inspired by a classic triangle counting algorithm by Chiba and Nishizeki, and explicitly uses triangles during message passing. We achieve the following results: (1)~EB-1WL is significantly more expressive than 1-WL. Further, we provide a complete logical characterization of EB-1WL based on first-order logic, and matching distinguishability results based on homomorphism counting. (2)~In an important distinction from previous proposals for more expressive GNN architectures, EB-1WL and EB-GNN require near-linear time and memory on practical graph learning tasks. (3)~Empirically, we show that EB-GNN is a highly-efficient general-purpose architecture: It substantially outperforms simple MPNNs, and remains competitive with task-specialized GNNs while being significantly more computationally efficient.

AI Insights

EB‑1WL’s logical characterization is grounded in first‑order logic, matching its distinguishability to homomorphism‑counting results.
The architecture is inspired by the Chiba–Nishizeki triangle‑counting algorithm, explicitly using triangles during message passing.
EB‑GNN processes each edge independently, achieving near‑linear time and memory on graphs with millions of nodes and edges.
Extensive experiments on QM9, QMD, CSL, BREC, and MalNet‑Tiny show EB‑GNN surpasses GIN, GAT, and MPNN on graph‑level, edge‑level, and node‑level tasks.
Speed benchmarks on QM9 demonstrate EB‑GNN runs faster than competing models while retaining competitive accuracy.
The authors provide a detailed hyperparameter‑tuning protocol and release the full implementation on GitHub.
The paper’s theoretical framework links EB‑1WL to first‑order logic, offering a rigorous foundation for its enhanced expressiveness.

Deep Learning

👍 👎 ♥ Save

Rock Classification through Knowledge-Enhanced Deep Learning: A Hybrid Mineral-Based Approach

Rate this image: 😍 👍 👎

Abstract
Automated rock classification from mineral composition presents a significant challenge in geological applications, with critical implications for material recycling, resource management, and industrial processing. While existing methods using One dimensional Convolutional Neural Network (1D-CNN) excel at mineral identification through Raman spectroscopy, the crucial step of determining rock types from mineral assemblages remains unsolved, particularly because the same minerals can form different rock types depending on their proportions and formation conditions. This study presents a novel knowledge-enhanced deep learning approach that integrates geological domain expertise with spectral analysis. The performance of five machine learning methods were evaluated out of which the 1D-CNN and its uncertainty-aware variant demonstrated excellent mineral classification performance (98.37+-0.006% and 97.75+-0.010% respectively). The integrated system's evaluation on rock samples revealed variable performance across lithologies, with optimal results for limestone classification but reduced accuracy for rocks sharing similar mineral assemblages. These findings not only show critical challenges in automated geological classification systems but also provide a methodological framework for advancing material characterization and sorting technologies.

👍 👎 ♥ Save

Why the noise model matters: A performance gap in learned regularization

University of Bremen, and

Abstract
This article addresses the challenge of learning effective regularizers for linear inverse problems. We analyze and compare several types of learned variational regularization against the theoretical benchmark of the optimal affine reconstruction, i.e. the best possible affine linear map for minimizing the mean squared error. It is known that this optimal reconstruction can be achieved using Tikhonov regularization, but this requires precise knowledge of the noise covariance to properly weight the data fidelity term. However, in many practical applications, noise statistics are unknown. We therefore investigate the performance of regularization methods learned without access to this noise information, focusing on Tikhonov, Lavrentiev, and quadratic regularization. Our theoretical analysis and numerical experiments demonstrate that for non-white noise, a performance gap emerges between these methods and the optimal affine reconstruction. Furthermore, we show that these different types of regularization yield distinct results, highlighting that the choice of regularizer structure is critical when the noise model is not explicitly learned. Our findings underscore the significant value of accurately modeling or co-learning noise statistics in data-driven regularization.

Diffusion Models

👍 👎 ♥ Save

Nonparametric Data Attribution for Diffusion Models

Rate this image: 😍 👍 👎

Abstract
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at https://github.com/sail-sg/NDA.

👍 👎 ♥ Save

Km-scale dynamical downscaling through conformalized latent diffusion models

CNR, Institute of Intellg

Abstract
Dynamical downscaling is crucial for deriving high-resolution meteorological fields from coarse-scale simulations, enabling detailed analysis for critical applications such as weather forecasting and renewable energy modeling. Generative Diffusion models (DMs) have recently emerged as powerful data-driven tools for this task, offering reconstruction fidelity and more scalable sampling supporting uncertainty quantification. However, DMs lack finite-sample guarantees against overconfident predictions, resulting in miscalibrated grid-point-level uncertainty estimates hindering their reliability in operational contexts. In this work, we tackle this issue by augmenting the downscaling pipeline with a conformal prediction framework. Specifically, the DM's samples are post-processed to derive conditional quantile estimates, incorporated into a conformalized quantile regression procedure targeting locally adaptive prediction intervals with finite-sample marginal validity. The proposed approach is evaluated on ERA5 reanalysis data over Italy, downscaled to a 2-km grid. Results demonstrate grid-point-level uncertainty estimates with markedly improved coverage and stable probabilistic scores relative to the DM baseline, highlighting the potential of conformalized generative models for more trustworthy probabilistic downscaling to high-resolution meteorological fields.

AI Insights

Residual corrective diffusion models are enhanced with conformal inference to produce grid‑wise uncertainty intervals with finite‑sample validity.
On ERA5 and COSMO5.0 CLM9 data, the conformalized DM shows higher reliability and stable probabilistic scores than traditional downscaling.
Future work proposes multivariate conformal prediction and addresses exchangeability violations in data splits.
The study highlights how diffusion‑model hyperparameters and backbone design affect sampling dispersion and predictive accuracy.
Related works such as “Residual Corrective Diffusion Modeling for km‑Scale Atmospheric Downscaling” and “Latent Diffusion Models for Generative Precipitation Nowcasting” illustrate the field’s rapid growth.
Recommended texts “Conformal Prediction for Reliable Machine Learning” and “A Gentle Introduction to Conformal Prediction” provide essential theory for the uncertainty framework.

Multimodal Learning

👍 👎 ♥ Save

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

MiroMind AI, The Uni of S

Abstract
Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

👍 👎 ♥ Save

Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning

Scuola Internazionale di

Abstract
The rapid increase in multimodal data availability has sparked significant interest in cross-modal knowledge distillation (KD) techniques, where richer "teacher" modalities transfer information to weaker "student" modalities during model training to improve performance. However, despite successes across various applications, cross-modal KD does not always result in improved outcomes, primarily due to a limited theoretical understanding that could inform practice. To address this gap, we introduce the Cross-modal Complementarity Hypothesis (CCH): we propose that cross-modal KD is effective when the mutual information between teacher and student representations exceeds the mutual information between the student representation and the labels. We theoretically validate the CCH in a joint Gaussian model and further confirm it empirically across diverse multimodal datasets, including image, text, video, audio, and cancer-related omics data. Our study establishes a novel theoretical framework for understanding cross-modal KD and offers practical guidelines based on the CCH criterion to select optimal teacher modalities for improving the performance of weaker modalities.

AI Insights

LMI is a non‑parametric estimator that operates on low‑dimensional embeddings, sidestepping high‑dimensional MI estimation.
It is embedded in a theoretically motivated architecture that ensures representations are MI‑friendly.
On synthetic, CMU‑MOSEI, and BRCA/KIPAN/LIHC cancer data, it beats direct fusion and baselines.
CCH states a teacher is effective when I(T;S) > I(S;Y).
The guideline is to pick the teacher with the largest I(T;S)–I(S;Y) gap.
Compared to MINE, LMI matches accuracy with less compute, and the paper defines Knowledge Distillation and Multimodal Fusion.

Deep Learning Optimization

👍 👎 ♥ Save

Transformer-based Scalable Beamforming Optimization via Deep Residual Learning

Columbia University, New

Rate this image: 😍 👍 👎

Abstract
We develop an unsupervised deep learning framework for downlink beamforming in large-scale MU-MISO channels. The model is trained offline, allowing real-time inference through lightweight feedforward computations in dynamic communication environments. Following the learning-to-optimize (L2O) paradigm, a multi-layer Transformer iteratively refines both channel and beamformer features via residual connections. To enhance training, three strategies are introduced: (i) curriculum learning (CL) to improve early-stage convergence and avoid local optima, (ii) semi-amortized learning to refine each Transformer block with a few gradient ascent steps, and (iii) sliding-window training to stabilize optimization by training only a subset of Transformer blocks at a time. Extensive simulations show that the proposed scheme outperforms existing baselines at low-to-medium SNRs and closely approaches WMMSE performance at high SNRs, while achieving substantially faster inference than iterative and online learning approaches.

AI Insights

Curriculum learning starts with low‑SNR channels, then ramps up difficulty, accelerating convergence and avoiding local minima.
Semi‑amortized learning adds a few gradient steps after each Transformer block, refining beamformers quickly.
Sliding‑window training freezes a moving subset of layers during back‑propagation, stabilizing learning on large MU‑MISO matrices.
Ablation shows residual connections are essential; removing them drops performance by 3 dB at 10 dB SNR.
The model scales to 128‑antenna base stations, staying within 1.2 dB of WMMSE even at 30 dB SNR.
Training uses 10⁶ channel samples, but inference per sample is only 0.5 ms on a single GPU, enabling near‑real‑time deployment.
Read Xia et al.’s “Deep Learning Framework for Optimization of MISO Downlink Beamforming” for deeper insight into the Transformer design.

Large Language Models

👍 👎 ♥ Save

Language Models Model Language

Snowflake AI Research

Abstract
Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for "deep structure" or "grounding" to achieve an idealized linguistic "competence." We argue for a radical shift in perspective towards the empiricist principles of Witold Ma\'nczak, a prominent general and historical linguist. He defines language not as a "system of signs" or a "computational system of the brain" but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language's primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.

AI Insights

Mańczak’s study shows high‑frequency forms accelerate phonological shifts, mirrored in LLM embedding drift.
Statistical learning in LLMs aligns with children’s distributional cues, suggesting exposure alone can drive complex syntax acquisition.
LLM word‑segmentation recovers morpheme boundaries from distributional statistics alone, echoing usage‑based theories.
Empirical tests of stimulus‑poverty in LLMs expose biases mirroring real‑world language exposure disparities.
Mothers’ speech corpora fed to LLMs generate predictive models of early lexical development, bridging corpus linguistics and developmental psychology.
The paper recommends iterative evaluation of LLMs against frequency‑driven linguistic benchmarks to ensure model competence.

👍 👎 ♥ Save

ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models

Australian Broadcasting

Abstract
We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation's policies/principles as directions to move on a model's information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model's CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single informationgeometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability

AI Insights

ENIGMA’s diagnostics map hidden‑state manifold shifts that align with benchmark gains, visualizing training progress.
It builds on Amari‑Nagaoka’s differential‑geometric theory, grounding policy updates in manifold curvature.
InfoNCE metrics tighten the mutual‑information lower bound with matched negatives, checking policy encoding.
Experiments show ENIGMA beats rivals in predictive information, perplexity, and sufficiency index on 1 B‑parameter models.
The single‑loop trainer fuses GRPO, SAMI, and entropic Sinkhorn regularization, removing the need for a reward model.
High‑SI principles predict steadier training dynamics, letting users pre‑select policies that boost performance.
Read Amari & Nagaoka’s “Information Geometry” and recent SAMI and optimal‑transport papers for deeper insight.

Mixture of Experts

👍 👎 ♥ Save

GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models

ByteDance Seed, UC Berkly

Abstract
Modern large language models leverage Mixture-of-Experts (MoE) architectures for efficient scaling, but face a critical challenge: functionally similar experts are often selected simultaneously, creating redundant computation and limiting effective model capacity. Existing auxiliary balance loss methods improve token distribution but fail to address the underlying expert diversity problem. We introduce GatePro, a novel parameter-free method that directly promotes expert selection diversity. GatePro identifies the most similar expert pairs and introduces localized competition mechanisms, preventing redundant expert co-activation while maintaining natural expert specialization. Our comprehensive evaluation demonstrates GatePro's effectiveness across model scales and benchmarks. Analysis demonstrates GatePro's ability to achieve enhanced expert diversity, where experts develop more distinct and complementary capabilities, avoiding functional redundancy. This approach can be deployed hot-swappable during any training phase without additional learnable parameters, offering a practical solution for improving MoE effectiveness.

AI Insights

GatePro’s competitive propagation forces experts to “compete for the spotlight,” boosting utilization patterns that linger after the mechanism is off.
Because it’s parameter‑free, you can swap GatePro on or off during training without tweaking hyper‑parameters—just a flag change.
Longer GatePro exposure sharpens expert specialization, turning a crowd of similar models into a choir of complementary voices.
The method levels the token‑distribution playing field, preventing a few experts from hogging all the data.
A “training legacy effect” means GatePro’s benefits persist, giving a performance boost without extra runtime cost at inference.
For deeper dives, see the MoE scaling papers: 2203.16535, 2106.14448, 2004.04722, 1905.09790, and 1802.05365.
Note: GatePro can be computationally heavy during training, so plan resources accordingly.

👍 👎 ♥ Save

Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps

University of Science, Ho

Abstract
We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $\epsilon$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.

AI Insights

Theorem 4 shows that for κ>K0 the Voronoi loss h(κ)N decays as (log N/N)^{1/¯r(bGN)}, making DSC(κ)N suboptimal.
Thus DSC(κ)N is minimized uniquely at κ=K0, proving the dendrogram selector bKN converges to the true expert count without sweeps.
The MLE’s rate in over‑specified SGMoE links to solvability of polynomial equations that capture near‑nonidentifiable directions.
With ε‑contamination, the dendrogram criterion stays robust, whereas AIC, BIC, and ICML over‑select as N grows.
A maize proteomics case shows a two‑expert SGMoE stabilizes likelihood early, exposes a clear mixing‑measure hierarchy, and yields interpretable genotype‑phenotype maps.

Deep Learning Models

👍 👎 ♥ Save

Benchmarking Deep Learning Models for Laryngeal Cancer Staging Using the LaryngealCT Dataset

Deakin University

Rate this image: 😍 👍 👎

Abstract
Laryngeal cancer imaging research lacks standardised datasets to enable reproducible deep learning (DL) model development. We present LaryngealCT, a curated benchmark of 1,029 computed tomography (CT) scans aggregated from six collections from The Cancer Imaging Archive (TCIA). Uniform 1 mm isotropic volumes of interest encompassing the larynx were extracted using a weakly supervised parameter search framework validated by clinical experts. 3D DL architectures (3D CNN, ResNet18,50,101, DenseNet121) were benchmarked on (i) early (Tis,T1,T2) vs. advanced (T3,T4) and (ii) T4 vs. non-T4 classification tasks. 3D CNN (AUC-0.881, F1-macro-0.821) and ResNet18 (AUC-0.892, F1-macro-0.646) respectively outperformed the other models in the two tasks. Model explainability assessed using 3D GradCAMs with thyroid cartilage overlays revealed greater peri-cartilage attention in non-T4 cases and focal activations in T4 predictions. Through open-source data, pretrained models, and integrated explainability tools, LaryngealCT offers a reproducible foundation for AI-driven research to support clinical decisions in laryngeal oncology.

AI Insights

Weakly supervised search auto‑generated 1 mm isotropic laryngeal volumes, refined by experts.
3D GradCAM on thyroid cartilage showed peri‑cartilage focus in non‑T4 and sharp activations for T4.
Open‑source pretrained weights enable rapid fine‑tuning on new data.
Med3D and FairMedFM are recommended for 3D transfer learning and fairness assessment.
3D Slicer integration lets clinicians run models and inspect GradCAMs in a familiar platform.
Radiomics, quantitative feature extraction, synergises with deep learning to boost detection accuracy.
Deep residual and densely connected networks remain top backbones for volumetric cancer staging.

Help us improve your experience!