Hi!

Your personalized paper recommendations for 26 to 30 January, 2026.

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Tsinghua University

Rate paper: 👍 👎 ♥ Save

AI Insights

Diffusion Modeling: A generative paradigm that generates data by reversing a continuous corruption process. (ML: 0.95)👍👎
The training objective for standard AR models is to minimize the negative log-likelihood (NLL). (ML: 0.92)👍👎
Autoregressive (AR) modeling: A generative paradigm that predicts tokens sequentially, one at a time. (ML: 0.91)👍👎
Masked Autoregressive (MAR) modeling: A variant of AR modeling that uses a mask to predict missing tokens. (ML: 0.91)👍👎
Diffusion Modeling generates data by reversing a continuous corruption process, with two frameworks: Denoising Diffusion Probabilistic Models (DDPM) and the deterministic ODE framework (Flow Matching). (ML: 0.90)👍👎
Autoregressive (AR), Diffusion Modeling, and Masked Autoregressive (MAR) modeling are three primary generative paradigms utilized in UMMs. (ML: 0.90)👍👎
Denoising Diffusion Probabilistic Models (DDPM): A framework for generating data by reversing a continuous corruption process, using the reparameterization trick. (ML: 0.87)👍👎
Unified Multimodal Model (UMM): A model that can handle multiple input modalities and generate output in various formats. (ML: 0.86)👍👎
Unified Multimodal Models (UMMs) have been a significant area of research in recent years. (ML: 0.85)👍👎
MaskGit employs a masked prediction strategy to enable parallel token prediction and utilize bidirectional context. (ML: 0.80)👍👎

Abstract
Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

Why we are recommending this paper?
Due to your Interest in multimodal models

This paper explores the core concept of unified multimodal models, directly aligning with your interest in fusion models and their ability to mutually reinforce understanding and generation. The focus on a cycle of understanding and generation is particularly relevant to your research interests.

Reversible Efficient Diffusion for Image Fusion

Tianjin University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The proposed RED model achieves superior performance compared to existing methods and demonstrates strong generalization across diverse scenarios and tasks. (ML: 0.92)👍👎
However, the choice of the number of diffusion steps is currently empirical, and an optimal step count cannot be precisely determined and may vary across tasks. (ML: 0.91)👍👎
However, the choice of the number of diffusion steps is currently empirical, and an optimal step count cannot be precisely determined and may vary across tasks. (ML: 0.91)👍👎
Invertible Neural Networks: A type of neural network that can be inverted, allowing for the recovery of the input from the output. (ML: 0.91)👍👎
The proposed model, called RED (Reversible End-to-End Diffusion), achieves superior performance compared to existing methods and demonstrates strong generalization across diverse scenarios and tasks. (ML: 0.86)👍👎
The paper proposes a reversible fusion paradigm that significantly reduces memory usage, enabling end-to-end training of diffusion models without relying on Markov chains. (ML: 0.86)👍👎
Reversible End-to-End Diffusion (RED): A model that achieves superior performance compared to existing methods and demonstrates strong generalization across diverse scenarios and tasks. (ML: 0.86)👍👎
Denoising Diffusion Probabilistic Models: A class of models that use a Markov chain to model the diffusion process and can be used for image synthesis and other tasks. (ML: 0.85)👍👎
The reversible fusion paradigm adopted by RED introduces a time-for-space trade-off, which results in slower training times. (ML: 0.81)👍👎
The reversible fusion paradigm adopted by RED introduces a time-for-space trade-off, which results in slower training times. (ML: 0.81)👍👎

Abstract
Multi-modal image fusion aims to consolidate complementary information from diverse source images into a unified representation. The fused image is expected to preserve fine details and maintain high visual fidelity. While diffusion models have demonstrated impressive generative capabilities in image generation, they often suffer from detail loss when applied to image fusion tasks. This issue arises from the accumulation of noise errors inherent in the Markov process, leading to inconsistency and degradation in the fused results. However, incorporating explicit supervision into end-to-end training of diffusion-based image fusion introduces challenges related to computational efficiency. To address these limitations, we propose the Reversible Efficient Diffusion (RED) model - an explicitly supervised training framework that inherits the powerful generative capability of diffusion models while avoiding the distribution estimation.

Why we are recommending this paper?
Due to your Interest in Image Processing

Given your interest in image fusion, this paper’s exploration of diffusion models for fusing multi-modal images offers a promising approach to combining different visual sources. The emphasis on detail preservation and visual fidelity is a key aspect of image processing.

SONIC: Spectral Oriented Neural Invariant Convolutions

Netherlands Cancer Institute

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

SONIC is designed for image classification tasks and achieves state-of-the-art results on several benchmarks, including ImageNet and medical imaging datasets. (ML: 0.90)👍👎
The use of complex-valued filters in the Fourier domain allows for efficient computation and improved performance compared to traditional convolutional neural networks. (ML: 0.87)👍👎
SONIC: A neural network architecture that uses complex-valued filters in the Fourier domain for image classification tasks. (ML: 0.87)👍👎
The authors propose a novel way to learn complex-valued filters using a combination of real and imaginary parts, allowing for efficient computation and improved performance. (ML: 0.84)👍👎
The paper introduces a new neural network architecture called SONIC, which uses complex-valued filters in the Fourier domain. (ML: 0.83)👍👎
The proposed SONIC architecture achieves state-of-the-art results on several benchmarks, demonstrating its effectiveness for image classification tasks. (ML: 0.80)👍👎
SonicBlock: A building block of the SONIC architecture, consisting of a group normalization layer, a GELU activation function, and a residual spectral convolutional mapping. (ML: 0.80)👍👎
SynthShape: A dataset of synthetic images with geometric primitives at random positions and scales, used to evaluate model robustness. (ML: 0.76)👍👎
The paper assumes that the input images are represented as complex-valued tensors, which may not be feasible for all applications. (ML: 0.73)👍👎
The authors provide a detailed implementation guide and discuss practical considerations for designing SONIC-based models. (ML: 0.70)👍👎

Abstract
Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.

Why we are recommending this paper?
Due to your Interest in convolution

This paper addresses the limitations of standard CNNs in capturing global context, a crucial element for your interest in convolution and image processing. The use of spectral oriented convolutions provides a novel approach to long-range dependencies.

Similarity of Processing Steps in Vision Model Representations

Universitat Pompeu Fabra

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Information imbalance: a measure of how similar or dissimilar the intermediate representational spaces are across different models. (ML: 0.99)👍👎
Imagine you're trying to understand how different people solve math problems. (ML: 0.99)👍👎
This suggests that while there may be shared processing steps, they are not universal and can vary significantly between models. (ML: 0.98)👍👎
Further research is needed to understand how these findings relate to robustness, fairness, and safety in real-world applications. (ML: 0.98)👍👎
You might think that everyone would use similar steps, like adding and subtracting numbers. (ML: 0.98)👍👎
That's kind of what this study is looking at, but with vision models instead of math problems. (ML: 0.98)👍👎
The results show that there is a convergence in information imbalance between layers within each model, but not necessarily between models. (ML: 0.97)👍👎
But what if some people used completely different methods? (ML: 0.97)👍👎
The study investigates the similarity of intermediate processing steps across different vision models, with a focus on understanding how high-performing vision systems process visual information. (ML: 0.97)👍👎
The study is primarily empirical and descriptive, and does not establish that any observed convergence in processing is optimal or necessary. (ML: 0.97)👍👎
This study contributes to a more detailed understanding of how high-performing vision systems process visual information. (ML: 0.95)👍👎
The study examines whether the convergence observed in representations of vision models at specific depths is also reflected in the intermediate processing steps that lead to those representations. (ML: 0.94)👍👎
Previous studies have shown that vision models can be highly effective at recognizing objects and scenes, but the underlying processing steps are not well understood. (ML: 0.94)👍👎

Abstract
Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal'' representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations -- in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.

Why we are recommending this paper?
Due to your Interest in Image Recognition

The investigation into universal representations in vision models aligns directly with your interest in understanding the underlying processing steps within models. This paper’s focus on model representation similarity is a valuable area of exploration.

Primitive-Driven Acceleration of Hyperdimensional Computing for Real-Time Image Classification

University of Southern California

Rate paper: 👍 👎 ♥ Save

AI Insights

Imagine you have a big box of LEGOs, and each LEGO brick represents a piece of information. (ML: 0.97)👍👎
The authors also mention that their work is related to other machine learning techniques, such as deep learning and neural networks. (ML: 0.94)👍👎
Hyperdimensional Computing (HDC): A machine learning technique that uses high-dimensional vectors to represent data points and perform computations. (ML: 0.89)👍👎
Hyperdimensional computing is like building a giant LEGO structure using all these bricks to represent complex data points. (ML: 0.86)👍👎
The authors demonstrate the effectiveness of their method on several benchmark datasets, including MNIST and Fashion-MNIST. (ML: 0.83)👍👎
The authors demonstrate the effectiveness of their method on several benchmark datasets, including MNIST and Fashion-MNIST. (ML: 0.83)👍👎
The authors demonstrate the effectiveness of Laplace HDC on several benchmark datasets, including MNIST and Fashion-MNIST. (ML: 0.83)👍👎
The paper assumes that the reader is familiar with HDC and its applications, which may limit its accessibility to non-experts in the field. (ML: 0.82)👍👎
However, the paper does not provide a comprehensive review of the literature on HDC or its applications. (ML: 0.79)👍👎
Additionally, the authors do not provide a detailed explanation of the Laplace HDC algorithm, which may make it difficult for readers to implement the method on their own. (ML: 0.79)👍👎
The paper presents a new way of building this LEGO structure called Laplace HDC, which is more efficient and accurate than the traditional method. (ML: 0.75)👍👎
The paper presents a novel approach to hyperdimensional computing (HDC) using Laplace HDC, which is a geometric interpretation of binary HDC. (ML: 0.72)👍👎
The paper presents a novel approach to hyperdimensional computing (HDC) using Laplace HDC, which is a geometric interpretation of binary HDC. (ML: 0.72)👍👎
The authors also provide a geometric interpretation of binary HDC, which is useful for understanding the underlying principles of HDC. (ML: 0.70)👍👎
The results show that Laplace HDC outperforms traditional HDC methods in terms of accuracy and efficiency. (ML: 0.69)👍👎
The paper cites several previous works on HDC and its applications, including [1], [2], and [3]. (ML: 0.63)👍👎

Abstract
Hyperdimensional Computing (HDC) represents data using extremely high-dimensional, low-precision vectors, termed hypervectors (HVs), and performs learning and inference through lightweight, noise-tolerant operations. However, the high dimensionality, sparsity, and repeated data movement involved in HDC make these computations difficult to accelerate efficiently on conventional processors. As a result, executing core HDC operations: binding, permutation, bundling, and similarity search: on CPUs or GPUs often leads to suboptimal utilization, memory bottlenecks, and limits on real-time performance. In this paper, our contributions are two-fold. First, we develop an image-encoding algorithm that, similar in spirit to convolutional neural networks, maps local image patches to hypervectors enriched with spatial information. These patch-level hypervectors are then merged into a global representation using the fundamental HDC operations, enabling spatially sensitive and robust image encoding. This encoder achieves 95.67% accuracy on MNIST and 85.14% on Fashion-MNIST, outperforming prior HDC-based image encoders. Second, we design an end-to-end accelerator that implements these compute operations on an FPGA through a pipelined architecture that exploits parallelism both across the hypervector dimensionality and across the set of image patches. Our Alveo U280 implementation delivers 0.09ms inference latency, achieving up to 1300x and 60x speedup over state-of-the-art CPU and GPU baselines, respectively.

Why we are recommending this paper?
Due to your Interest in Image Recognition

This paper explores Hyperdimensional Computing, a technique that could be applied to your interest in convolution and image recognition. The focus on real-time image classification using HDC is a potentially impactful application of this technology.

Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

Meta Reality Labs and University of Illinois UrbanaChampaign

Rate paper: 👍 👎 ♥ Save

AI Insights

External knowledge sources: Sources of information outside the image or text data, such as search engines or Wikipedia. (ML: 0.97)👍👎
The paper discusses the limitations of current VQA approaches and highlights the importance of incorporating external knowledge sources in multimodal VQA. (ML: 0.96)👍👎
Large language models (LLMs): Pre-trained models that can be fine-tuned for various natural language processing tasks. (ML: 0.95)👍👎
The authors also introduce a new dataset, called Visual Entity Recognition (VER), which consists of millions of Wikipedia entities. (ML: 0.93)👍👎
The proposed method, called Search-R1, uses reinforcement learning to train LLMs to reason and leverage search engines. (ML: 0.91)👍👎
The paper proposes a novel approach to multimodal visual question answering (VQA) by leveraging large language models (LLMs) and external knowledge sources. (ML: 0.91)👍👎
Visual entity recognition (VER): A task that involves recognizing entities in images, such as objects, people, or locations. (ML: 0.91)👍👎
Multimodal visual question answering (VQA): A task that involves answering questions about images by combining information from multiple modalities, such as text and vision. (ML: 0.89)👍👎
The paper presents a novel approach to multimodal VQA by leveraging LLMs and external knowledge sources. (ML: 0.86)👍👎
Search-R1 achieves state-of-the-art results on several VQA benchmarks, including the Visual Genome dataset. (ML: 0.86)👍👎

Abstract
Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.

Why we are recommending this paper?
Due to your Interest in multimodal models

A joint diffusion approach to multi-modal inference in inertial confinement fusion

Lawrence Livermore National Laboratory

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Task-specific models perform better than the unified JointDiff model on specific tasks, but the unified model outperforms them when considering all tasks together. (ML: 0.97)👍👎
Task-specific models outperform the unified JointDiff model on specific tasks, indicating potential areas for improvement. (ML: 0.97)👍👎
The percentage of true values captured within the predicted standard deviation (% 2𝜎) measures the accuracy of predictions. (ML: 0.96)👍👎
Previous studies have shown that task-specific models can perform better than unified frameworks on specific tasks. (ML: 0.96)👍👎
The diffusion objective may not be as effective in certain scenarios or with different types of data. (ML: 0.96)👍👎
𝑅2 (R-squared) is a measure of how well a regression model fits the data. (ML: 0.95)👍👎
It ranges from 0 to 1, with higher values indicating better fit. (ML: 0.93)👍👎
The diffusion objective is a crucial component of the joint model, significantly improving performance on both forward and inverse tasks. (ML: 0.92)👍👎
The diffusion objective significantly improves the performance of the joint model for both forward and inverse tasks. (ML: 0.92)👍👎
The unified JointDiff model outperforms task-specific models when considering all tasks together, demonstrating its potential for real-world applications in rocket-piston simulations. (ML: 0.83)👍👎
The JointDiff model is a unified framework for both forward and inverse modeling tasks, achieving state-of-the-art performance in predicting rocket-piston simulations. (ML: 0.77)👍👎

Abstract
A combination of physics-based simulation and experiments has been critical to achieving ignition in inertial confinement fusion (ICF). Simulation and experiment both produce a mixture of scalar and images outputs, however only a subset of simulated data are available experimentally. We introduce a generative framework, called JointDiff, which enables predictions of conditional simulation input and output distributions from partial, multi-modal observations. The model leverages joint diffusion to unify forward surrogate modeling, inverse inference, and output imputation into one architecture. We train our model on a large ensemble of three-dimensional Multi-Rocket Piston simulations and demonstrate high accuracy, statistical robustness, and transferability to experiments performed at the National Ignition Facility (NIF). This work establishes JointDiff as a flexible generative surrogate for multi-modal scientific tasks, with implications for understanding diagnostic constraints, aligning simulation to experiment, and accelerating ICF design.

Why we are recommending this paper?
Due to your Interest in fusion models

Cross-Fusion Distance: A Novel Metric for Measuring Fusion and Separability Between Data Groups in Representation Space

Oregon Health and Science University

Rate paper: 👍 👎 ♥ Save

AI Insights

Within-group dispersions: measures of the spread of points within each group. (ML: 0.95)👍👎
Geometric displacement: the difference between the centroids of two groups. (ML: 0.89)👍👎
The CFD satisfies several desirable properties, including non-negativity, zero case, monotonicity in geometric displacement, and limit cases. (ML: 0.84)👍👎
Cross-Fusion Score (CFS): a measure used in the definition of CFD, defined as wAσ2_A + wBσ2_B / σ2_AB. (ML: 0.77)👍👎
CFD is based on the concept of fusion, where two groups are combined into a single cloud with a centroid and dispersion. (ML: 0.71)👍👎
A comprehensive suite of synthetic experiments was conducted to validate the theoretical properties of CFD under controlled conditions. (ML: 0.70)👍👎
Cross-Fusion Distance (CFD): a novel distance metric for evaluating separability of two point clouds in high-dimensional spaces. (ML: 0.58)👍👎
CFD is shown to be robust to global scaling, topological deformation, and sensitivity to geometric displacements and dispersion variations. (ML: 0.57)👍👎
Fused dispersion: a measure of the spread of points in the combined cloud. (ML: 0.57)👍👎
The Cross-Fusion Distance (CFD) is a novel distance metric designed for evaluating the separability of two point clouds in high-dimensional spaces. (ML: 0.54)👍👎

Abstract
Quantifying degrees of fusion and separability between data groups in representation space is a fundamental problem in representation learning, particularly under domain shift. A meaningful metric should capture fusion-altering factors like geometric displacement between representation groups, whose variations change the extent of fusion, while remaining invariant to fusion-preserving factors such as global scaling and sampling-induced layout changes, whose variations do not. Existing distributional distance metrics conflate these factors, leading to measures that are not informative of the true extent of fusion between data groups. We introduce Cross-Fusion Distance (CFD), a principled measure that isolates fusion-altering geometry while remaining robust to fusion-preserving variations, with linear computational complexity. We characterize the invariance and sensitivity properties of CFD theoretically and validate them in controlled synthetic experiments. For practical utility on real-world datasets with domain shift, CFD aligns more closely with downstream generalization degradation than commonly used alternatives. Overall, CFD provides a theoretically grounded and interpretable distance measure for representation learning.

Why we are recommending this paper?
Due to your Interest in fusion models

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback