Hi!

Your personalized paper recommendations for 15 to 19 December, 2025.

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

University of Modena and

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

These models have the potential to improve human-computer interaction, decision-making, and problem-solving in various domains. [3]
The model learns to predict the missing information or reconstruct the input data. [3]
The field of multimodal learning is rapidly evolving with the development of new architectures and techniques. [2]

Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Why we are recommending this paper?
Due to your Interest in: Multimodal Learning

This paper directly addresses the user's interest in Large Language Models and their multimodal capabilities, specifically focusing on visual reasoning – a key area within deep learning. The exploration of how MLLMs learn visual understanding aligns perfectly with the user’s interest in diffusion models and large language models.

PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Peking University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The results show that incorporating the PoseMoE Decoder into the dual-branch network achieves a measurable improvement on its own, verifying the efficacy of the proposed strategies and mechanism for delayed, selective knowledge fusion. [3]
Mixture-of-experts (MoE) architecture: A type of neural network architecture where multiple experts are trained to handle different parts of the input data, and then combined using a gating mechanism. [3]
Multi-task learning: A technique in machine learning where multiple tasks are learned simultaneously, often with shared features or parameters. [3]
Dual-branch network: A type of neural network architecture that consists of two parallel branches, each processing different aspects of the input data. [3]
PoseMoE achieves state-of-the-art performance on several benchmark datasets and demonstrates robustness to noisy 2D pose input. [3]
The ablation studies confirm the effectiveness of each component within PoseMoE, including the multi-task learning baseline, PME (w/o 2D Expert), PME (w/o Depth Expert), and PME + PMD. [3]
The proposed method, PoseMoE, is a novel 3D human pose estimation framework that combines multi-task learning and mixture-of-experts architecture. [2]

Abstract
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

Why we are recommending this paper?
Due to your Interest in: Mixture of Experts

Given the user's interest in deep learning architectures and models, this paper’s focus on Mixture-of-Experts for 3D human pose estimation is highly relevant. The use of a MoE network is a significant trend in deep learning, and the application to a visually-grounded task aligns with their broader interests.

Cornserve: Efficiently Serving Any-to-Any Multimodal Models

University of Michigan

Rate paper: 👍 👎 ♥ Save

AI Insights

Cornserve's planner decides to disaggregate the LLM (1 replica) and the audio generator (7 and 15 replicas, respectively) to balance the throughput of each component as much as possible. [3]
Qwen 3 Omni [45] and Qwen 2.5 Omni [44] are multimodal input & output models that take a combination of text, images, video, and audio as input, and generates either text or audio. [3]
Any-to-Any models: generic models that can handle multiple input and output modalities. [2]
Cornserve improves the throughput of serving Qwen 2.5 Omni [44] over the baseline monolithic deployment on 8-GPU and 16-GPU cells by 3.09 × and 3.81 ×, respectively. [1]
Qwen 2.5 Omni [44]: a multimodal input & output model that takes a combination of text, images, video, and audio as input, and generates either text or audio. [0]

Abstract
We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving. Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve's planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve's distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81$\times$ throughput improvement and up to 5.79$\times$ tail latency reduction over existing solutions.

Why we are recommending this paper?
Due to your Interest in: Multimodal Learning

This paper’s exploration of efficient serving systems for Any-to-Any multimodal models directly addresses the user's interest in large language models and their ability to handle diverse data types. The focus on optimization is a critical aspect of deep learning model deployment.

What Affects the Effective Depth of Large Language Models?

Peking University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

KL divergence: A measure of the difference between two probability distributions. [3]
It is used to quantify how much one distribution diverges from another. [3]
Logit lens KL divergence: The KL divergence between the logit lens (a representation of the output distribution) of an early layer and the final output distribution. [3]
The results indicate that skipping a layer has varying effects on different models and datasets. [3]
Some models exhibit significant changes in their output distributions, while others show minimal impact. [3]
The KL divergence and logit lens KL divergence measures suggest that some models undergo substantial changes in their output distributions when a layer is skipped, whereas others remain relatively stable. [3]
The overlap measure indicates that the top-5 predictions of early layers often share common elements with the final distribution, suggesting that the model's decision-making process is consistent across different layers. [3]
The analysis is limited by its reliance on pre-trained models, which may not accurately represent the performance of the models when trained from scratch. [3]
The results are based on a specific set of models and datasets, and it is unclear whether the findings would generalize to other models and scenarios. [3]
The effects of skipping a layer on output distributions are shown in Figure 7. [2]
The results for KL divergence, logit lens KL divergence, and overlap between early layer distributions and the final distributions are presented in Figures 8 and 9. [0]

Abstract
The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of "effective depth", arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.

Why we are recommending this paper?
Due to your Interest in: Large Language Models

This paper tackles a fundamental question about the scaling of Large Language Models, which is central to the user's interest in deep learning models. Understanding the limitations of depth in LLMs is crucial for optimizing their performance and aligns with their interest in deep learning optimization.

INTELLECT-3: Technical Report

Prime Intellect

Rate paper: 👍 👎 ♥ Save

AI Insights

The paper describes the development of a large-scale language model called Prime Intellect, which is designed to perform a wide range of tasks in various domains. [3]
The model is trained using a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) phases. [3]
RL (Reinforcement Learning): A training phase that involves training the model to perform tasks in various domains using a combination of tools and environments. [3]
R2E-Gym: A software engineering (SWE) environment used to train the model on tasks such as fixing issues in Github projects. [3]
The RL phase involves training the model to perform tasks in various domains using a combination of tools and environments, including Prime Sandboxes, which is a custom-built environment for hosting over 20,000 images containing pre-installed Github repositories. [2]
The SFT phase involves training the model on a large dataset of reasoning traces from various sources, including math, code, science, and tool splits from NVIDIA's Nemotron-Post-Training-Dataset-v1 and chat and instruction following splits from AM's AM-DeepSeek-R1-0528-Distilled dataset. [1]

Abstract
We present INTELLECT-3, a 106B-parameter Mixture-of-Experts model (12B active) trained with large-scale reinforcement learning on our end-to-end RL infrastructure stack. INTELLECT-3 achieves state of the art performance for its size across math, code, science and reasoning benchmarks, outperforming many larger frontier models. We open-source the model together with the full infrastructure stack used to create it, including RL frameworks, complete recipe, and a wide collection of environments, built with the verifiers library, for training and evaluation from our Environments Hub community platform. Built for this effort, we introduce prime-rl, an open framework for large-scale asynchronous reinforcement learning, which scales seamlessly from a single node to thousands of GPUs, and is tailored for agentic RL with first-class support for multi-turn interactions and tool use. Using this stack, we run both SFT and RL training on top of the GLM-4.5-Air-Base model, scaling RL training up to 512 H200s with high training efficiency.

Why we are recommending this paper?
Due to your Interest in: Mixture of Experts

Coming from Prime Intellect, this paper’s presentation of a 106B-parameter Mixture-of-Experts model is highly relevant to the user’s interest in large language models. The focus on reinforcement learning training and state-of-the-art performance aligns with the user’s interest in deep learning models and optimization.

Evaluation of deep learning architectures for wildlife object detection: A comparative study of ResNet and Inception

Chuka University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Wildlife object detection plays a vital role in biodiversity conservation, ecological monitoring, and habitat protection. However, this task is often challenged by environmental variability, visual similarities among species, and intra-class diversity. This study investigates the effectiveness of two individual deep learning architectures ResNet-101 and Inception v3 for wildlife object detection under such complex conditions. The models were trained and evaluated on a wildlife image dataset using a standardized preprocessing approach, which included resizing images to a maximum dimension of 800 pixels, converting them to RGB format, and transforming them into PyTorch tensors. A ratio of 70:30 training and validation split was used for model development. The ResNet-101 model achieved a classification accuracy of 94% and a mean Average Precision (mAP) of 0.91, showing strong performance in extracting deep hierarchical features. The Inception v3 model performed slightly better, attaining a classification accuracy of 95% and a mAP of 0.92, attributed to its efficient multi-scale feature extraction through parallel convolutions. Despite the strong results, both models exhibited challenges when detecting species with similar visual characteristics or those captured under poor lighting and occlusion. Nonetheless, the findings confirm that both ResNet-101 and Inception v3 are effective models for wildlife object detection tasks and provide a reliable foundation for conservation-focused computer vision applications.

AI Insights

The results demonstrate that ResNet-101 attains a classification accuracy of 94% with a mAP of 0.91, while Inception v3 slightly outperforms it with 95% accuracy and a mAP of 0.92. [3]
ResNet-101: A deep neural network architecture that uses residual connections to learn hierarchical features. [3]
Inception v3: A deep neural network architecture that uses parallel convolution paths to extract multi-scale features. [3]
Precision, Recall, and F1-score: Metrics used to evaluate the performance of classification models, which measure the proportion of true positives, false positives, and false negatives. [3]
The results provide valuable insights into the strengths and limitations of each model, and they lay the groundwork for further exploration of hybrid or ensemble techniques that may combine the advantages of both architectures for even higher performance in complex real-world scenarios. [3]
ResNet-101 and Inception v3 are both highly effective for wildlife object detection, each excelling under different visual and environmental complexities. [2]

Why we are recommending this paper?
Due to your Interest in: Deep Learning Architectures

Application of Deep Learning in Biological Data Compression

The University of Hongk

Rate paper: 👍 👎 ♥ Save

Abstract
Cryogenic electron microscopy (Cryo-EM) has become an essential tool for capturing high-resolution biological structures. Despite its advantage in visualizations, the large storage size of Cryo-EM data file poses significant challenges for researchers and educators. This paper investigates the application of deep learning, specifically implicit neural representation (INR), to compress Cryo-EM biological data. The proposed approach first extracts the binary map of each file according to the density threshold. The density map is highly repetitive, ehich can be effectively compressed by GZIP. The neural network then trains to encode spatial density information, allowing the storage of network parameters and learnable latent vectors. To improve reconstruction accuracy, I further incorporate the positional encoding to enhance spatial representation and a weighted Mean Squared Error (MSE) loss function to balance density distribution variations. Using this approach, my aim is to provide a practical and efficient biological data compression solution that can be used for educational and research purpose, while maintaining a reasonable compression ratio and reconstruction quality from file to file.

AI Insights

The project establishes Implicit Neural Representation (INR) as a promising framework for Cryo-EM data compression, balancing efficiency and fidelity. [3]
The method achieves a compression ratio of approximately 10:1, reducing file sizes from 414 MB to around 40 MB, outperforming traditional GZIP compression. [3]
Experimental results demonstrate notable progress in surpassing GZIP's compression ratio and achieving high reconstruction quality for structurally significant areas. [3]
GZIP: a file format used for data compression that typically yields lower ratios on complex Cryo-EM data. [3]
INR (Implicit Neural Representation): a framework for representing scenes or data using neural networks, allowing for efficient and accurate reconstruction. [3]
Future work may focus on automating hyperparameter tuning and refining the INR architecture to reduce low-density errors. [3]
Limitations persist in low-density regions, where mean errors exceed 1000% due to noise and sparsity. [3]
The project establishes INR as a promising tool for Cryo-EM data management, particularly in resource-limited settings. [2]
Cryo-EM (Cryogenic Electron Microscopy): a technique used to determine the three-dimensional structure of macromolecules, such as proteins. [1]

Why we are recommending this paper?
Due to your Interest in: Deep Learning Architectures

Deep Learning Perspective of Scene Understanding in Autonomous Robots

National Textile Universt

Rate paper: 👍 👎 ♥ Save

Abstract
This paper provides a review of deep learning applications in scene understanding in autonomous robots, including innovations in object detection, semantic and instance segmentation, depth estimation, 3D reconstruction, and visual SLAM. It emphasizes how these techniques address limitations of traditional geometric models, improve depth perception in real time despite occlusions and textureless surfaces, and enhance semantic reasoning to understand the environment better. When these perception modules are integrated into dynamic and unstructured environments, they become more effective in decisionmaking, navigation and interaction. Lastly, the review outlines the existing problems and research directions to advance learning-based scene understanding of autonomous robots.

Why we are recommending this paper?
Due to your Interest in: Deep Learning

Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models

Kings College London

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.

AI Insights

A novel application of spectral entropy loss function is introduced, which enhances the model’s predictive performance under both homogeneous and inhomogeneous tissue conditions. [3]
DL: Deep Learning CNN: Convolutional Neural Network R²: Coefficient of Determination MAE: Mean Absolute Error This study shows that DL can accurately solve the forward problem in electrocardiology, matching in-silico ground truth from 2D voltage propagation. [3]
Our encoder–decoder model, enhanced with attention, time embeddings, and a spectral entropy loss, performed well in both homogeneous and inhomogeneous tissue. [3]
DL can effectively map voltage propagation patterns to extracellular signals. [2]

Why we are recommending this paper?
Due to your Interest in: Deep Learning Optimization

Linguists should learn to love speech-based deep learning models

University of Amsterdam

Rate paper: 👍 👎 ♥ Save

Abstract
Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article's focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.

AI Insights

Speech-based deep learning models can capture linguistic structure without relying on pre-existing symbolic categories, as demonstrated by studies using representational probes. [3]
The bottleneck of text in linguistic modeling can be replaced by exploring audio-based deep learning models that learn from the speech signal itself. [3]
Linguistic interpretability studies in the speech domain have shown that speech-based models can capture higher-level patterns that make up spoken language, including phonemes, words, and morphophonological representations. [3]
The use of bidirectional connection weights in neural networks can model language as a unidirectional optimization problem, achieving bidirectional optimality. [3]
Self-supervised learning can lead to the emergence of human-like perception biases in speech models, such as the detection of algebraic auditory structures. [3]
Representational probes: A method used to investigate how neural networks represent and process linguistic information. [3]
Bidirectional connection weights: A type of neural network architecture where connections between units are bidirectional, allowing for both forward and backward flow of information. [3]
Self-supervised learning: A type of machine learning where models learn from raw data without explicit supervision or labeling. [3]
The use of speech-based deep learning models can provide a more nuanced understanding of linguistic structure and processing, moving beyond the limitations of text-based approaches. [3]
Further research is needed to fully explore the potential of audio-based deep learning models in linguistic modeling and interpretation. [3]

Why we are recommending this paper?
Due to your Interest in: Deep Learning Models

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

University of Turku

Rate paper: 👍 👎 ♥ Save

Abstract
We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

Why we are recommending this paper?
Due to your Interest in: Large Language Models

Pattern-Guided Diffusion Models

University of Pennsylvann

Rate paper: 👍 👎 ♥ Save

Abstract
Diffusion models have shown promise in forecasting future data from multivariate time series. However, few existing methods account for recurring structures, or patterns, that appear within the data. We present Pattern-Guided Diffusion Models (PGDM), which leverage inherent patterns within temporal data for forecasting future time steps. PGDM first extracts patterns using archetypal analysis and estimates the most likely next pattern in the sequence. By guiding predictions with this pattern estimate, PGDM makes more realistic predictions that fit within the set of known patterns. We additionally introduce a novel uncertainty quantification technique based on archetypal analysis, and we dynamically scale the guidance level based on the pattern estimate uncertainty. We apply our method to two well-motivated forecasting applications, predicting visual field measurements and motion capture frames. On both, we show that pattern guidance improves PGDM's performance (MAE / CRPS) by up to 40.67% / 56.26% and 14.12% / 14.10%, respectively. PGDM also outperforms baselines by up to 65.58% / 84.83% and 93.64% / 92.55%.

AI Insights

Diffusion Models: A type of generative model that learns to transform a noise signal into a data sample through a series of steps, where each step is represented by a learnable transformation. [3]
The use of archetypal analysis as a pre-processing step enables the model to learn more meaningful patterns from data, leading to improved performance. [3]
Experiments on two case studies demonstrate that PGDM outperforms state-of-the-art methods in terms of accuracy and robustness. [2]
The paper presents a novel approach to pattern-guided sequence prediction using diffusion models, which leverages the concept of archetypal analysis to extract meaningful patterns from data. [1]

Why we are recommending this paper?
Due to your Interest in: Diffusion Models

Corrective Diffusion Language Models

University of WisconsinM

Rate paper: 👍 👎 ♥ Save

Abstract
Diffusion language models are structurally well-suited for iterative error correction, as their non-causal denoising dynamics allow arbitrary positions in a sequence to be revised. However, standard masked diffusion language model (MDLM) training fails to reliably induce this behavior, as models often cannot identify unreliable tokens in a complete input, rendering confidence-guided refinement ineffective. We study corrective behavior in diffusion language models, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a correction-oriented post-training principle that explicitly supervises visible incorrect tokens, enabling error-aware confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and controlled settings demonstrate that models trained with our approach substantially outperform standard MDLMs in correction scenarios, while also improving pure completion performance. Our code is publicly available at https://github.com/zhangshuibai/CDLM.

Why we are recommending this paper?
Due to your Interest in: Diffusion Models

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback