fusion models

BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

The Hong Kong Polytechnic

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

AI Summary

The paper presents a new method for 3D object detection from point clouds, called SparseFusion. [2]

Optimization of laser illumination configuration for directly driven inertial confinement fusion

Osaka University

Rate paper: 👍 👎 ♥ Save

Abstract
Optimum laser configurations are presented to achieve high illumination uniformity with directly driven inertial confinement fusion targets. Assuming axisymmetric absorption pattern of individual laser beams, theoretical models are reviewed in terms of the number of laser beams, system imperfection, and laser beam patterns. Utilizing a self-organizing system of charged particles on a sphere, a simple numerical model is provided to give an optimal configuration for an arbitrary number of laser beams. As a result, such new configurations as M48 and M60 are found to show substantially higher illumination uniformity than any other existing direct drive systems. A new polar direct-drive scheme is proposed with the laser axes keeping off the target center, which can be applied to laser configurations designed for indirectly driven inertial fusion.

multimodal models

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

University of Zurich

Rate paper: 👍 👎 ♥ Save

Abstract
This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.

AI Summary

Z-score normalization: A score fusion strategy that combines scores from multiple sources using Z-scores, which are normalized values with a mean of 0 and a standard deviation of 1. [3]
The proposed ImageBind-LoRA system achieves remarkable cross-lingual generalization by fine-tuning a multimodal foundation model on Arabic data and evaluating it on English, German, and Urdu test sets. [2]

DyFuLM: An Advanced Multimodal Framework for Sentiment Analysis

Xian JiaotongLiverpool

Rate paper: 👍 👎 ♥ Save

Abstract
Understanding sentiment in complex textual expressions remains a fundamental challenge in affective computing. To address this, we propose a Dynamic Fusion Learning Model (DyFuLM), a multimodal framework designed to capture both hierarchical semantic representations and fine-grained emotional nuances. DyFuLM introduces two key moodules: a Hierarchical Dynamic Fusion module that adaptively integrates multi-level features, and a Gated Feature Aggregation module that regulates cross-layer information ffow to achieve balanced representation learning. Comprehensive experiments on multi-task sentiment datasets demonstrate that DyFuLM achieves 82.64% coarse-grained and 68.48% fine-grained accuracy, yielding the lowest regression errors (MAE = 0.0674, MSE = 0.0082) and the highest R^2 coefficient of determination (R^2= 0.6903). Furthermore, the ablation study validates the effectiveness of each module in DyFuLM. When all modules are removed, the accuracy drops by 0.91% for coarse-grained and 0.68% for fine-grained tasks. Keeping only the gated fusion module causes decreases of 0.75% and 0.55%, while removing the dynamic loss mechanism results in drops of 0.78% and 0.26% for coarse-grained and fine-grained sentiment classification, respectively. These results demonstrate that each module contributes significantly to feature interaction and task balance. Overall, the experimental findings further validate that DyFuLM enhances sentiment representation and overall performance through effective hierarchical feature fusion.

Image Processing

ReasonX: MLLM-Guided Intrinsic Image Decomposition

Imperial College London

Rate paper: 👍 👎 ♥ Save

Abstract
Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.

AI Summary

The paper presents a new framework called ReasonX for improving the performance of intrinsic image decomposition models. [3]
The framework uses a combination of analytical and perceptual rewards to guide the optimization process. [3]
Analytical Rewards: Rewards based on mathematical equations that describe the relationships between different modalities. [3]
Perceptual Rewards: Rewards based on human perception of the quality of the decomposition. [3]
The framework's ability to incorporate both analytical and perceptual rewards allows it to better capture the complexities of real-world images. [3]
The paper assumes that the pre-trained model distribution is known, which may not be the case in practice. [3]
The KL regularization term used to prevent reward hacking and overly aggressive updates may not always work as intended. [3]
Previous works on intrinsic image decomposition have focused on developing new models or techniques for improving performance. [3]
ReasonX is a new framework for improving intrinsic image decomposition models that uses a combination of analytical and perceptual rewards to guide the optimization process. [3]
Imagine you're trying to decompose a picture into its constituent parts, like the color of an object and how much light is shining on it. [3]
The ReasonX framework helps improve this process by using both mathematical equations and human perception to guide the optimization. [3]
This results in more accurate and robust decomposition models that can handle real-world images. [3]
Intrinsic Image Decomposition (IID): A technique that decomposes an image into its constituent materials, such as albedo (reflectance) and irradiance (illumination). [2]

Fast & Efficient Normalizing Flows and Applications of Image Generative Models

IIIT

Rate paper: 👍 👎 ♥ Save

Abstract
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.

convolution

Efficient Spatially-Variant Convolution via Differentiable Sparse Kernel Complex

Zhejiang University

Rate paper: 👍 👎 ♥ Save

Abstract
Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.

AI Summary

The framework enables filter-space interpolation, allowing for complex, spatially-varying effects with minimal per-pixel overhead. [3]
Differentiable: A function or model that can be computed using a series of mathematical operations and whose output can be differentiated (i.e., the rate of change of the output with respect to each input) is called differentiable. [3]
Convolution kernels: These are mathematical functions used in signal processing and image analysis to describe how signals or images are transformed under convolution operation. [3]
Its fully differentiable nature allows it to serve as a trainable layer within modern deep learning pipelines. [3]
Prior methods may not have been able to handle such complex kernels efficiently or accurately. [3]
The paper introduces a differentiable framework that recasts the challenging problem of approximating large, complex convolution kernels as an end-to-end optimization task. [2]
This approach robustly handles a wide variety of kernels—from simple Gaussians to complex, non-convex forms—and converges to high-fidelity solutions far more efficiently than prior methods. [1]

HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition

Vietnam National Universt

Rate paper: 👍 👎 ♥ Save

Abstract
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.

Image Recognition

Lean Unet: A Compact Model for Image Segmentation

Uppsala University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Unet and its variations have been standard in semantic image segmentation, especially for computer assisted radiology. Current Unet architectures iteratively downsample spatial resolution while increasing channel dimensions to preserve information content. Such a structure demands a large memory footprint, limiting training batch sizes and increasing inference latency. Channel pruning compresses Unet architecture without accuracy loss, but requires lengthy optimization and may not generalize across tasks and datasets. By investigating Unet pruning, we hypothesize that the final structure is the crucial factor, not the channel selection strategy of pruning. Based on our observations, we propose a lean Unet architecture (LUnet) with a compact, flat hierarchy where channels are not doubled as resolution is halved. We evaluate on a public MRI dataset allowing comparable reporting, as well as on two internal CT datasets. We show that a state-of-the-art pruning solution (STAMP) mainly prunes from the layers with the highest number of channels. Comparatively, simply eliminating a random channel at the pruning-identified layer or at the largest layer achieves similar or better performance. Our proposed LUnet with fixed architectures and over 30 times fewer parameters achieves performance comparable to both conventional Unet counterparts and data-adaptively pruned networks. The proposed lean Unet with constant channel count across layers requires far fewer parameters while achieving performance superior to standard Unet for the same total number of parameters. Skip connections allow Unet bottleneck channels to be largely reduced, unlike standard encoder-decoder architectures requiring increased bottleneck channels for information propagation.

AI Summary

The paper presents a novel approach to neural network pruning called Lean Unet (LUnet), which achieves state-of-the-art results in medical image segmentation tasks while reducing the number of parameters by up to 99% compared to the original U-Net architecture. [3]
The authors propose a new pruning strategy that combines simultaneous training and model pruning, referred to as STAMP, which is shown to be more effective than traditional pruning methods. [3]
The authors also discuss the limitations of their approach, including the need for careful tuning of hyperparameters and the potential for overfitting. [3]
STAMP: Simultaneous Training and Model Pruning for Low Data Regimes in Medical Image Segmentation Lean Unet (LUnet): A novel pruning strategy that combines simultaneous training and model pruning, achieving state-of-the-art results in medical image segmentation tasks while reducing the number of parameters by up to 99% compared to the original U-Net architecture. [3]
The paper presents a novel approach to neural network pruning called Lean Unet (LUnet), which achieves state-of-the-art results in medical image segmentation tasks while reducing the number of parameters by up to 99% compared to the original U-Net architecture. [3]
The authors propose a new pruning strategy that combines simultaneous training and model pruning, referred to as STAMP, which is shown to be more effective than traditional pruning methods. [3]
LUnet is evaluated on three medical image segmentation tasks: submandibular gland (SG), tracheal tree (TT), and hippocampus (Hippocampal Segmentation). [2]

Rethinking the Use of Vision Transformers for AI-Generated Image Detection

KAIST

Rate paper: 👍 👎 ♥ Save

Abstract
Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

Help us improve your experience!