Hi!

Your personalized paper recommendations for 05 to 09 January, 2026.

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

USTC

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

Why we are recommending this paper?
Due to your Interest in Multimodal Learning

This paper explores UniCorn, a novel approach to improving Unified Multimodal Models by leveraging self-generated supervision – directly addressing the challenge of knowledge utilization in these models. Given the user’s interest in LLMs and multimodal learning, this work offers a promising direction for advancement.

Beyond the Black Box: Theory and Mechanism of Large Language Models

Renmin University of China

Rate paper: 👍 👎 ♥ Save

AI Insights

Transformers are a type of neural network that can process sequential data. [3]
Recent studies have shown that they can approximate certain functions and solve specific tasks, but there are also limitations to their capabilities. [3]
Upper bounds: Upper bounds typically rely on specific constructions to demonstrate that a model can represent a given function or solve a given task. [2]
The community has renewed interest in the expressive capacity of Transformers, especially in terms of universal approximation. [1]

Abstract
The rapid emergence of Large Language Models (LLMs) has precipitated a profound paradigm shift in Artificial Intelligence, delivering monumental engineering successes that increasingly impact modern society. However, a critical paradox persists within the current field: despite the empirical efficacy, our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as ``black boxes''. To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation. Within this framework, we provide a systematic review of the foundational theories and internal mechanisms driving LLM performance. Specifically, we analyze core theoretical issues such as the mathematical justification for data mixtures, the representational limits of various architectures, and the optimization dynamics of alignment algorithms. Moving beyond current best practices, we identify critical frontier challenges, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent intelligence. By connecting empirical observations with rigorous scientific inquiry, this work provides a structured roadmap for transitioning LLM development from engineering heuristics toward a principled scientific discipline.

Why we are recommending this paper?
Due to your Interest in Large Language Models

Coming from Renmin University of China, this paper directly tackles the fundamental paradox surrounding Large Language Models, which aligns perfectly with the user’s interest in LLMs. It seeks to provide a theoretical understanding of these powerful models, a critical need given their rapid development.

Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts

University of Chinese Academy of Sciences

Rate paper: 👍 👎 ♥ Save

Abstract
Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a "Coherence Barrier"; when expert representations exhibit high mutual coherence, greedy routing strategies theoretically fail to recover the optimal expert subset. Importantly, we formally verify that imposing geometric orthogonality in the expert feature space is sufficient to narrow the divide between the NP-hard global optimum and polynomial-time greedy approximation. Our comparative analyses confirm orthogonality regularization as the optimal engineering relaxation for large-scale models. Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.

Why we are recommending this paper?
Due to your Interest in Mixture of Experts

This paper from the University of Chinese Academy of Sciences offers a theoretical foundation for Mixture-of-Experts models, a key area of interest for the user. The focus on unifying the mechanisms within this architecture is highly relevant to their exploration of deep learning models.

Mass Concept Erasure in Diffusion Models with Concept Hierarchy

Zhejiang University

Rate paper: 👍 👎 ♥ Save

Abstract
The success of diffusion models has raised concerns about the generation of unsafe or harmful content, prompting concept erasure approaches that fine-tune modules to suppress specific concepts while preserving general generative capabilities. However, as the number of erased concepts grows, these methods often become inefficient and ineffective, since each concept requires a separate set of fine-tuned parameters and may degrade the overall generation quality. In this work, we propose a supertype-subtype concept hierarchy that organizes erased concepts into a parent-child structure. Each erased concept is treated as a child node, and semantically related concepts (e.g., macaw, and bald eagle) are grouped under a shared parent node, referred to as a supertype concept (e.g., bird). Rather than erasing concepts individually, we introduce an effective and efficient group-wise suppression method, where semantically similar concepts are grouped and erased jointly by sharing a single set of learnable parameters. During the erasure phase, standard diffusion regularization is applied to preserve denoising process in unmasked regions. To mitigate the degradation of supertype generation caused by excessive erasure of semantically related subtypes, we propose a novel method called Supertype-Preserving Low-Rank Adaptation (SuPLoRA), which encodes the supertype concept information in the frozen down-projection matrix and updates only the up-projection matrix during erasure. Theoretical analysis demonstrates the effectiveness of SuPLoRA in mitigating generation performance degradation. We construct a more challenging benchmark that requires simultaneous erasure of concepts across diverse domains, including celebrities, objects, and pornographic content.

Why we are recommending this paper?
Due to your Interest in Diffusion Models

Addressing the growing concerns around unsafe content generation in diffusion models, this research directly relates to the user’s interest in diffusion models and deep learning. The concept erasure approach provides a valuable technique for mitigating potential risks.

Challenges and Research Directions for Large Language Model Inference Hardware

Google

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Autoregressive Decode is a major challenge for memory and interconnect latency, exacerbated by MoE, reasoning, multimodal data, RAG, and long input/output sequences. [3]
The computer architecture community has made great contributions on challenges when a realistic simulator was available, as it has previously for branch prediction and cache design. [3]
HBF: High Bandwidth Flash PNM: Processing-Near-Memory PIM: Processing-In-Memory 3D Stacking: A technique for increasing bandwidth by stacking memory layers on top of each other The current AI hardware philosophy is a mismatch to LLM Decode inference, and improving memory and network along four directions (HBF, PNM, 3D Stacking, and low latency interconnect) could unlock collaborative work towards important and urgent innovations. [3]
The increasing importance and difficulty of inference for Large Language Models (LLMs) is an attractive research target. [2]

Abstract
Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.

Why we are recommending this paper?
Due to your Interest in Large Language Models

This paper from Google directly addresses the practical challenges of deploying Large Language Models, a crucial area for the user’s interest in deep learning optimization and hardware considerations. The focus on memory and interconnect issues is highly pertinent to efficient LLM inference.

On average population levels for models with directed diffusion in heterogeneous environments

University of Calgary

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
In 2006 (J. Differential Equ.), Lou proved that, once the intrinsic growth rate $r$ in the logistic model is proportional to the spatially heterogeneous carrying capacity $K$ ($r=K^1$), the total population under the regular diffusion exceeds the total of the carrying capacity. He also conjectured that the dependency of the total population on the diffusion coefficient is unimodal, increasing to its maximum and then decreasing to the asymptote which is the total of the carrying capacity. DeAngelis et al (J. Math. Biol. 2016) argued that the prevalence of the population over the carrying capacity is only observed when the growth rate and the carrying capacity are positively correlated, at least for slow dispersal. Guo et al (J. Math. Biol. 2020) justified that, once $r$ is constant ($r=K^0$), the total population is less than the cumulative carrying capacity. Our paper fills up the gap for when $r=K^λ$ for any real $λ$, disproving an assumption that there is a critical $λ^{\ast} \in (0,1)$ at which the tendency of the prevalence of the carrying capacity over the total population size changes, demonstrating instead that the relationship is more complicated. In addition, we explore the dependency of the total population size on the diffusion coefficient when the third parameter of the dispersal strategy $P$ is involved: the diffusion term is $d Δ(u/P)$, not just $d Δu$, for any $λ$. We outline some differences from the random diffusion case, in particular, concerning the profile of the total population as a function of the diffusion coefficient.

Why we are recommending this paper?
Due to your Interest in Diffusion Models

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

Georgia Institute of Technology

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model's natural optimization path, thereby limiting training efficiency and performance.

Why we are recommending this paper?
Due to your Interest in Mixture of Experts

Optimization of Deep Learning Models for Radio Galaxy Classification

Zurich University of Applied Sciences ZHAW

Rate paper: 👍 👎 ♥ Save

Abstract
Modern radio telescope surveys, capable of detecting billions of galaxies in wide-field surveys, have made manual morphological classification impracticable. This applies in particular when the Square Kilometre Array Observatory (SKAO) becomes operable in 2027, which is expected to close an important gap in our understanding of the Epoch of Reionization (EoR) and other areas of astrophysics. To this end, foreground objects, contaminants of the 21-cm signal, need to be identified and subtracted. Source finding and identification is thus an important albeit challenging task. We investigate the ability of AI and deep learning (DL) methods that have been previously trained on other data domains to localize and classify radio galaxies with minimal changes to their architectures. Various well-known pretrained neural network architectures for image classification and object detection are trained and fine-tuned and their performance is evaluated on a public radio galaxy dataset derived from the Radio Galaxy Zoo. A comparison between convolutional neural network (CNN)- and transformer-based algorithms is performed. The best performing architecture is systematically optimized and an uncertainty estimation is performed by means of an ensemble analysis. Radio source classification performance nearly comparable to the current leading customized models can be obtained using existing standard pretrained DL architectures, without modification and increase in complexity of the model architectures but rather adaptation of the data, by combining various transformations on replicated image channels. Using an ensemble of models can also further improve performance to over 90% accuracy, on par with top-performing models in the literature. The results can be transferred to other survey data, e.g. from the Murchison Wide-field Array (MWA), and in the future be used to study the EoR with the SKAO.

Why we are recommending this paper?
Due to your Interest in Deep Learning Models

Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data

University of Essex

Rate paper: 👍 👎 ♥ Save

Abstract
I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty while preserving a principled mathematical foundation. The drift and diffusion terms of the SDE are parameterized by neural networks, enabling data-driven inference and generalizing classical time series models to handle irregular sampling and complex dynamic structure. A central theoretical contribution is the co-parameterization of the adjoint state with a dedicated neural network, forming a coupled forward-backward system that captures not only latent evolution but also gradient dynamics. I introduce a pathwise-regularized adjoint loss and analyze variance-reduced gradient flows through the lens of stochastic calculus, offering new tools for improving training stability in deep latent SDEs. My paper unifies and extends variational inference, continuous-time generative modeling, and control-theoretic optimization, providing a rigorous foundation for future developments in stochastic probabilistic machine learning.

Why we are recommending this paper?
Due to your Interest in Deep Learning Models

Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images

Wageningen University & Research

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on

Why we are recommending this paper?
Due to your Interest in Multimodal Learning

From Memorization to Creativity: LLM as a Designer of Novel Neural-Architectures

University of Wrzburg

Rate paper: 👍 👎 ♥ Save

Abstract
Large language models (LLMs) excel in program synthesis, yet their ability to autonomously navigate neural architecture design--balancing syntactic reliability, performance, and structural novelty--remains underexplored. We address this by placing a code-oriented LLM within a closed-loop synthesis framework, analyzing its evolution over 22 supervised fine-tuning cycles. The model synthesizes PyTorch convolutional networks which are validated, evaluated via low-fidelity performance signals (single-epoch accuracy), and filtered using a MinHash-Jaccard criterion to prevent structural redundancy. High-performing, novel architectures are converted into prompt-code pairs for iterative fine-tuning via parameter-efficient LoRA adaptation, initialized from the LEMUR dataset. Across cycles, the LLM internalizes empirical architectural priors, becoming a robust generator. The valid generation rate stabilizes at 50.6 percent (peaking at 74.5 percent), while mean first-epoch accuracy rises from 28.06 percent to 50.99 percent, and the fraction of candidates exceeding 40 percent accuracy grows from 2.04 percent to 96.81 percent. Analyses confirm the model moves beyond replicating existing motifs, synthesizing 455 high-performing architectures absent from the original corpus. By grounding code synthesis in execution feedback, this work provides a scalable blueprint for transforming stochastic generators into autonomous, performance-driven neural designers, establishing that LLMs can internalize empirical, non-textual rewards to transcend their training data.

Why we are recommending this paper?
Due to your Interest in Deep Learning Architectures

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

Deep Learning Optimization
Deep Learning

You can edit or add more interests any time.

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback