Deep Learning Architectures

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

University of Chicago, 2

Abstract
We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

AI Insights

A KL budget behaves like a ∥·∥_P norm budget up to second‑order terms, giving a geometric handle on policy drift.
In the small‑radius regime, a KL ball is second‑order equivalent to a radius‑ε ball in the ∥·∥_P norm, enabling tight local guarantees.
The KL‑regularizer can be re‑interpreted as a compositional guardrail, forcing the agent’s distribution to stay within a provably safe compositional envelope.
Waluigi Shattering outperforms pure reinforcement by strictly amplifying misalignment suppression, illustrating a constructive antagonistic counter‑strategy.
The unified theorem stitches these equivalences together, offering a single statement that links KL budgets, norm budgets, and compositional safety.
These results suggest that a carefully calibrated KL budget can act as a principled leash, keeping deep agents both expressive and aligned.

👍 👎 ♥ Save

Towards Interpretable Deep Neural Networks for Tabular Data

Marburg University

Abstract
Tabular data is the foundation of many applications in fields such as finance and healthcare. Although DNNs tailored for tabular data achieve competitive predictive performance, they are blackboxes with little interpretability. We introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features within the latent space used for prediction. Using an automated method, we assign human-interpretable semantics to these features. This allows us to represent predictions as linear combinations of semantically meaningful components. Empirical evaluations demonstrate that XNNTab attains performance on par with or exceeding that of state-of-the-art, black-box neural models and classical machine learning approaches while being fully interpretable.

AI Insights

XNNTab’s sparse autoencoder learns monosemantic dictionary features that map to human‑readable rules.
On the ADULT benchmark, these dictionary features are generated by applying data‑driven rules to age, education, and capital gain.
In the CHURN dataset, rule‑derived dictionary features uncover subtle customer‑attrition signals missed by conventional models.
Empirical tests show XNNTab matches or exceeds black‑box DNNs while providing transparent linear explanations.
The approach depends heavily on training‑data quality, so noisy or biased data can distort dictionary semantics.
Future work may automate rule discovery or use transfer learning to broaden applicability across domains.
The subjectivity in rule selection still poses a challenge for reproducibility and generalization.

Deep Learning

👍 👎 ♥ Save

An Interpretable Deep Learning Model for General Insurance Pricing

UNSW Sydney NSW 2052, AU

Abstract
This paper introduces the Actuarial Neural Additive Model, an inherently interpretable deep learning model for general insurance pricing that offers fully transparent and interpretable results while retaining the strong predictive power of neural networks. This model assigns a dedicated neural network (or subnetwork) to each individual covariate and pairwise interaction term to independently learn its impact on the modeled output while implementing various architectural constraints to allow for essential interpretability (e.g. sparsity) and practical requirements (e.g. smoothness, monotonicity) in insurance applications. The development of our model is grounded in a solid foundation, where we establish a concrete definition of interpretability within the insurance context, complemented by a rigorous mathematical framework. Comparisons in terms of prediction accuracy are made with traditional actuarial and state-of-the-art machine learning methods using both synthetic and real insurance datasets. The results show that the proposed model outperforms other methods in most cases while offering complete transparency in its internal logic, underscoring the strong interpretability and predictive capability.

Diffusion Models

👍 👎 ♥ Save

Locality in Image Diffusion Models Emerges from Data Statistics

Massachusetts Institute

Rate this image: 😍 👍 👎

Abstract
Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

AI Insights

The weighted streaming softmax (WSSM) denoiser aggregates soft‑max products across training batches, yielding a closed‑form analytical solution that rivals deep UNet denoisers.
Empirical tests on CelebA‑HQ, AFHQ, CIFAR‑10, MNIST, and Fashion‑MNIST show WSSM surpasses Wiener, Kamb & Ganguli, and Niedoba baselines in PSNR and SSIM.
The algorithmic cost scales as O(n p t m), where n is dataset size, p patch dimension, t denoising step, and m flattened resolution, offering a tighter bound than competing O(n p t m²) methods.
A high‑performance server (1008 GB RAM, 128 cores, 8 RTX A6000 GPUs) averages 21.25 s per run, illustrating the method’s practical feasibility for large‑scale image collections.
The locality observed in deep diffusion models is analytically linked to pixel‑correlation statistics, not merely CNN inductive bias, reshaping our understanding of diffusion inductive priors.
For deeper dives, consult “Image Denoising: Theory and Applications” and the seminal papers on Kamb & Ganguli’s model, Niedoba et al.’s approach, and the recent Weighted Streaming Softmax framework.
Future work could explore reducing WSSM’s computational footprint by approximating the soft‑max product or leveraging sparsity in natural image statistics.

👍 👎 ♥ Save

Integrating Anatomical Priors into a Causal Diffusion Model

Stanford University, USA

Abstract
3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.

AI Insights

PCGM fuses a probabilistic graph module with 3D ControlNet to carve voxel‑level anatomical masks that guide counterfactual denoising UNet.
The counterfactual denoising UNet learns to preserve subtle cortical thickness variations while still generating realistic brain MRIs.
A 3D diffusion decoder translates the masked latent space into high‑resolution scans that faithfully reproduce disease‑specific morphometric signatures.
Experiments show PCGM’s counterfactuals recover known Alzheimer‑related cortical thinning patterns, a first for synthetic MRI studies.
Definition: An anatomical prior is a voxel‑wise constraint derived from population atlases that enforces biologically plausible structure during synthesis.
For deeper insight, read “Diffusion Models: A Comprehensive Survey” and the paper “Generating Realistic Brain MRIs via a Conditional Diffusion Probabilistic Model”.

Multimodal Learning

👍 👎 ♥ Save

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Abstract
Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.

👍 👎 ♥ Save

Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models

Shanghai Jiao Tong Univer

Abstract
Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.

AI Insights

Contrastive learning is employed to pretrain 3D encoders, boosting semantic depth.
Multimodal biomedical foundation models pre‑trained on millions of image‑text pairs supply rich semantic priors for 3D volumes.
Transformer architectures enable precise segmentation and localization in medical imaging tasks.
Contrastive learning aligns 3D volumes with their textual descriptions, enriching semantic embeddings.
Pre‑trained multimodal biomedical foundation models provide cross‑modal priors that accelerate 3D representation learning.
The framework’s plane‑slice‑aware transformer efficiently fuses 3D and 2D modalities across slices.
Partial optimal transport alignment tolerates noise from LLM‑generated content, enhancing robustness.

Deep Learning Optimization

👍 👎 ♥ Save

Heart Disease Prediction: A Comparative Study of Optimisers Performance in Deep Neural Networks

University of Nigeria and

Abstract
Optimization has been an important factor and topic of interest in training deep learning models, yet less attention has been given to how we select the optimizers we use to train these models. Hence, there is a need to dive deeper into how we select the optimizers we use for training and the metrics that determine this selection. In this work, we compare the performance of 10 different optimizers in training a simple Multi-layer Perceptron model using a heart disease dataset from Kaggle. We set up a consistent training paradigm and evaluate the optimizers based on metrics such as convergence speed and stability. We also include some other Machine Learning Evaluation metrics such as AUC, Precision, and Recall, which are central metrics to classification problems. Our results show that there are trade-offs between convergence speed and stability, as optimizers like Adagrad and Adadelta, which are more stable, took longer time to converge. Across all our metrics, we chose RMSProp to be the most effective optimizer for this heart disease prediction task because it offered a balanced performance across key metrics. It achieved a precision of 0.765, a recall of 0.827, and an AUC of 0.841, along with faster training time. However, it was not the most stable. We recommend that, in less compute-constrained environments, this method of choosing optimizers through a thorough evaluation should be adopted to increase the scientific nature and performance in training deep learning models.

AI Insights

The paper lists ten classic optimizers—SGD, Adam, AdamW, Adamax, RMSProp, AMSGrad, Nesterov‑SGD, Adagrad, Adadelta, Nadam—plus their original references.
It stresses that each optimizer’s hyperparameters must be tuned for optimal performance, a detail often missed in tutorials.
The authors provide concise pseudocode for every algorithm, making the comparison accessible to both researchers and practitioners.
The study shows no single optimizer dominates all metrics; choice depends on the specific problem.
It calls for continued research into more efficient algorithms, suggesting future work could reshape the current hierarchy.
For deeper exploration, the paper cites foundational works such as Kingma & Ba (2014) for Adam and Zeiler (2012) for ADADELTA.

👍 👎 ♥ Save

MetaLLMix : An XAI Aided LLM-Meta-learning Based Approach for Hyper-parameters Optimization

Universit dEvryValdE

Abstract
Effective model and hyperparameter selection remains a major challenge in deep learning, often requiring extensive expertise and computation. While AutoML and large language models (LLMs) promise automation, current LLM-based approaches rely on trial and error and expensive APIs, which provide limited interpretability and generalizability. We propose MetaLLMiX, a zero-shot hyperparameter optimization framework combining meta-learning, explainable AI, and efficient LLM reasoning. By leveraging historical experiment outcomes with SHAP explanations, MetaLLMiX recommends optimal hyperparameters and pretrained models without additional trials. We further employ an LLM-as-judge evaluation to control output format, accuracy, and completeness. Experiments on eight medical imaging datasets using nine open-source lightweight LLMs show that MetaLLMiX achieves competitive or superior performance to traditional HPO methods while drastically reducing computational cost. Our local deployment outperforms prior API-based approaches, achieving optimal results on 5 of 8 tasks, response time reductions of 99.6-99.9%, and the fastest training times on 6 datasets (2.4-15.7x faster), maintaining accuracy within 1-5% of best-performing baselines.

AI Insights

MetaLLMix uses SHAP‑annotated past experiments to zero‑shot recommend hyperparameters, cutting trial‑and‑error.
An LLM‑as‑judge enforces output format, accuracy, and completeness for reproducible suggestions.
Local deployment of nine lightweight LLMs gives 99.6–99.9 % faster responses and 2.4–15.7× quicker training, within 1–5 % of best accuracy.
It outperforms traditional HPO on 5 of 8 medical imaging tasks while slashing computational cost.
MetaLLMix – an XAI‑aided LLM‑meta‑learning framework that optimizes hyperparameters via SHAP‑guided history.
Explore “Explainable Bayesian Optimization” (2023) for interpretable Bayesian hyperparameter search.
Explore “HPExplorer” (2022) to visualize hyperparameter‑performance relationships.

Large Language Models

👍 👎 ♥ Save

Uncovering Scaling Laws for Large Language Models via Inverse Problems

SingaporeMIT Alliance, 2

Abstract
Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also efficiently uncover scaling laws that guide the building of LLMs to achieve the desirable performance with significantly better cost-effectiveness.

AI Insights

Hallucination remains a key reliability issue, where LLMs fabricate facts absent from training data.
Data poisoning attacks can subtly corrupt model outputs, underscoring the need for robust training pipelines.
Unlearning challenges arise when fine‑tuned LLMs retain spurious prior knowledge, complicating domain adaptation.
Data‑centric AI—curating diverse, high‑quality corpora—has shown measurable gains in robustness and fairness.
Explainability techniques such as Shapley attribution help demystify token‑level decision making in transformers.
Chain‑of‑Thought prompting turns LLMs into step‑wise reasoning engines, boosting zero‑shot performance on arithmetic tasks.
The book “Computational Methods for Inverse Problems” offers a rigorous foundation for applying inverse‑problem theory to language model scaling.

👍 👎 ♥ Save

Readme_AI: Dynamic Context Construction for Large Language Models

Abstract
Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user's specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog's developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: https://github.com/usnistgov/readme_ai .

Mixture of Experts

👍 👎 ♥ Save

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

Jiaming Yan, Jianchun Liu

Abstract
Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (LLMs). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically cache a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor cache hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic caches the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the cache hit rate. During each layer's inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of cached experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation overlap. Nevertheless, the performance of MoEpic critically depends on the cache configuration (i.e., each layer's VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive cache configuration. Extensive experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.

AI Insights

MoEpic splits each expert vertically into a lightweight top segment and a heavier bottom segment, allowing more experts to fit in limited VRAM.
The top segments of hot experts are cached, boosting cache hit rates, while the bottom segments are fetched on demand during inference.
A divide‑and‑conquer algorithm based on fixed‑point iteration automatically tunes per‑layer VRAM budgets and split ratios for optimal performance.
Dynamic prefetching of the next layer’s activated experts overlaps data transfer with computation, cutting loading latency by up to 65%.
MoEpic achieves up to 3.5× speedup over baseline MoE inference and can be combined with pruning or skipping techniques for further gains.
The method assumes evenly distributed GPU resources and may struggle when inter‑expert communication dominates, highlighting a key limitation.

👍 👎 ♥ Save

On Linear Mode Connectivity of Mixture-of-Experts Architectures

Abstract
Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been observed to be connected--up to permutation symmetries--by linear paths in parameter space along which the loss remains consistently low. This observation challenges classical views of non-convex optimization and has implications for model ensembling, generalization, and our understanding of neural loss geometry. Inspired by recent studies on LMC in standard neural networks, we systematically investigate this phenomenon within Mixture-of-Experts (MoE) architectures--a class of models known for their scalability and computational efficiency, which combine traditional neural networks--referred to as experts--through a learnable gating mechanism. We begin by conducting a comprehensive analysis of both dense and sparse gating regimes, demonstrating that the symmetries inherent to MoE architectures are fully characterized by permutations acting on both the expert components and the gating function. Building on these foundational findings, we propose a matching algorithm that enables alignment between independently trained MoEs, thereby facilitating the discovery of LMC. Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse MoE configurations--including dense, sparse, and shared-expert variants--under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in MoE architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models.

Interests not found

Help us improve your experience!