Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion

University of Technology

Why we think this paper is great for you:
This paper directly addresses multimodal models and information fusion, which are central themes in your research focus, especially concerning visual and textual data.

Rate paper: 👍 👎 ♥ Save

Abstract
Recent advances in multimodal recommendation (MMR) have shown that incorporating rich content sources such as images and text can lead to significant gains representation quality. However, existing methods often rely on coarse visual features and uncontrolled fusion, leading to redundant or misaligned representations. As a result, visual encoders often fail to capture salient, item-relevant semantics, limiting their contribution in multimodal fusion. From an information-theoretic perspective, effective fusion should balance the unique, shared, and redundant information across modalities, preserving complementary cues while avoiding correlation bias. This paper presents VLIF, a vision-language and information-theoretic fusion framework that enhances multimodal recommendation through two key components. (i) A VLM-based visual enrichment module generates fine-grained, title-guided descriptions to transform product images into semantically aligned representations. (ii) An information-aware fusion module, inspired by Partial Information Decomposition (PID), disentangles redundant and synergistic signals across modalities for controlled integration. Experiments on three Amazon datasets demonstrate that VLIF consistently outperforms recent multimodal baselines and substantially strengthens the contribution of visual features.

AI Summary

Employing an information-theoretic fusion module, inspired by Partial Information Decomposition (PID), effectively disentangles redundant and synergistic signals across modalities, leading to controlled and enhanced multimodal representation learning. [3]
The proposed framework significantly strengthens the contribution of the visual modality in multimodal recommendation, addressing the common limitation where visual features have limited impact compared to textual signals. [3]
Incorporating InfoNCE loss for both synergy and redundancy estimation during optimization helps regulate modality interactions, promoting alignment of synergistic information and maximizing mutual information for redundant components. [3]
Ablation studies confirm that both VLM-based visual enrichment and the information-aware fusion module are indispensable for achieving superior recommendation performance, with task-specific VLM guidance being particularly effective. [3]
The framework demonstrates general applicability across different VLM backbones and consistently outperforms strong multimodal baselines, indicating its robustness and effectiveness in diverse e-commerce scenarios. [3]
VLIF (Vision-Language and Information-theoretic Fusion): A framework that enhances multimodal recommendation by using VLMs for visual enrichment and an information-aware fusion module inspired by Partial Information Decomposition. [3]
Leveraging VLMs with task-specific, title-guided prompting is crucial for transforming raw product images into semantically aligned and fine-grained visual representations, mitigating issues of marketing-optimized images and generic VLM outputs. [2]
Orthogonal projection can be used to explicitly remove redundant information from unimodal representations, yielding unique and redundancy-free features that contribute distinctively to the fused embedding. [2]
VLM-based Visual Enrichment Module: A component that utilizes Vision-Language Models, guided by item titles and Chain-of-Thought prompting, to generate fine-grained, semantically aligned textual descriptions from product images. [2]
Information-Aware Fusion Module: A module inspired by Partial Information Decomposition (PID) that disentangles redundant, synergistic, and unique information across modalities for controlled and effective multimodal representation integration. [2]

Expanding the Content-Style Frontier: a Balanced Subspace Blending Approach for Content-Style LoRA Fusion

Beijing University of

Why we think this paper is great for you:
You will find this paper relevant for its exploration of advanced fusion techniques for content and style in image generation, aligning with your interest in combining different data aspects.

Rate paper: 👍 👎 ♥ Save

Abstract
Recent advancements in text-to-image diffusion models have significantly improved the personalization and stylization of generated images. However, previous studies have only assessed content similarity under a single style intensity. In our experiments, we observe that increasing style intensity leads to a significant loss of content features, resulting in a suboptimal content-style frontier. To address this, we propose a novel approach to expand the content-style frontier by leveraging Content-Style Subspace Blending and a Content-Style Balance loss. Our method improves content similarity across varying style intensities, significantly broadening the content-style frontier. Extensive experiments demonstrate that our approach outperforms existing techniques in both qualitative and quantitative evaluations, achieving superior content-style trade-off with significantly lower Inverted Generational Distance (IGD) and Generational Distance (GD) scores compared to current methods.

Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Technical University of M

Why we think this paper is great for you:
This work combines convolutional networks with vision transformers for image classification, directly aligning with your interest in both convolution and image recognition architectures.

Rate paper: 👍 👎 ♥ Save

Abstract
Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

International Institute

Why we think this paper is great for you:
This paper focuses on leveraging large multimodal models for visual question answering, which is highly relevant to your work with multimodal data and image understanding.

Rate paper: 👍 👎 ♥ Save

Abstract
Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification

Technical University of M

Why we think this paper is great for you:
This paper applies advanced computer vision techniques to medical image classification, offering insights into image recognition and processing methods you might find valuable.

Rate paper: 👍 👎 ♥ Save

Abstract
Covariance descriptors capture second-order statistics of image features. They have shown strong performance in general computer vision tasks, but remain underexplored in medical imaging. We investigate their effectiveness for both conventional and learning-based medical image classification, with a particular focus on SPDNet, a classification network specifically designed for symmetric positive definite (SPD) matrices. We propose constructing covariance descriptors from features extracted by pre-trained general vision encoders (GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and MedSAM - are evaluated across eleven binary and multi-class datasets from the MedMNSIT benchmark. Our results show that covariance descriptors derived from GVE features consistently outperform those derived from handcrafted features. Moreover, SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features. Our findings highlight the potential of combining covariance descriptors with powerful pretrained vision encoders for medical image analysis.

Enhancing Medical Image Segmentation via Heat Conduction Equation

University of California

Why we think this paper is great for you:
You will find this paper interesting as it delves into enhancing medical image segmentation, a key area within image processing that aligns with your expertise.

Rate paper: 👍 👎 ♥ Save

Abstract
Medical image segmentation has been significantly advanced by deep learning architectures, notably U-Net variants. However, existing models struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets simultaneously. In this work, we propose a novel hybrid architecture utilizing U-Mamba with Heat Conduction Equation. Our model combines Mamba-based state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers, simulating frequency-domain thermal diffusion for enhanced semantic abstraction. Experimental results on multimodal abdominal CT and MRI datasets demonstrate that the proposed model consistently outperforms strong baselines, validating its effectiveness and generalizability. It suggest that blending state-space dynamics with heat-based global diffusion offers a scalable and interpretable solution for medical segmentation tasks.

KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image

Chulalongkorn University

Why we think this paper is great for you:
This work on satellite image inpainting using diffusion models presents a strong match for your interest in advanced image processing and analysis techniques.

Rate paper: 👍 👎 ♥ Save

Abstract
Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.

Help us improve your experience!