π― Top Personalized Recommendations
AI Summary - DDC achieves performance improvement in all models in the other five noise scenarios, with a significant increase in accuracy for EfficientNet-B0 model. [3]
- The performance improvement is more stable and obvious for lightweight models such as EfficientNet-B0. [3]
- DDC has different performance in improving the anti-interference performance of different models, but has stable noise adaptability as a whole. [3]
- YOLOv8 performs best after replacing traditional convolution with DDC, with significant improvements in mAP@0.5 and mAP@0.5: 0.95 under six noise scenarios. [3]
- DDC: Deep Domain Confusion mAP@0.5: mean Average Precision at IoU=0.5 mAP@0.5 : 0.95: mean Average Precision at IoU=0.5 and IoU=0.95 DDC has stable anti-interference ability for most noise types, with significant performance improvement in image classification and object detection tasks. [3]
- The nonlinear interaction mechanism of DDC can better make up for the shortcomings of traditional convolution framework in feature selection, especially for lightweight models. [3]
- YOLOv8 is highly compatible with the nonlinear interaction mechanism of DDC, resulting in significant performance improvements in object detection task. [3]
Abstract
In real-world scenarios of image recognition, there exists substantial noise interference. Existing works primarily focus on methods such as adjusting networks or training strategies to address noisy image recognition, and the anti-noise performance has reached a bottleneck. However, little is known about the exploration of anti-interference solutions from a neuronal perspective.This paper proposes an anti-noise neuronal convolution. This convolution mimics the dendritic structure of neurons, integrates the neighborhood interaction computation logic of dendrites into the underlying design of convolutional operations, and simulates the XOR logic preprocessing function of biological dendrites through nonlinear interactions between input features, thereby fundamentally reconstructing the mathematical paradigm of feature extraction. Unlike traditional convolution where noise directly interferes with feature extraction and exerts a significant impact, DDC mitigates the influence of noise by focusing on the interaction of neighborhood information. Experimental results demonstrate that in image classification tasks (using YOLOv11-cls, VGG16, and EfficientNet-B0) and object detection tasks (using YOLOv11, YOLOv8, and YOLOv5), after replacing traditional convolution with the dendritic convolution, the accuracy of the EfficientNet-B0 model on noisy datasets is relatively improved by 11.23%, and the mean Average Precision (mAP) of YOLOv8 is increased by 19.80%. The consistency between the computation method of this convolution and the dendrites of biological neurons enables it to perform significantly better than traditional convolution in complex noisy environments.
Why we think this paper is great for you:
This paper directly addresses your interest in convolution by introducing a novel dendritic convolution method. It specifically applies to image recognition tasks, particularly in challenging noisy environments, which aligns well with your focus on image processing.
Abstract
Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.
Why we think this paper is great for you:
You will find this paper highly relevant as it explores the continual fusion of task-specific models, a core concept within fusion models. It offers insights into building versatile multi-task systems, which is key to your interest in multimodal approaches.
AI Summary - Vidi2 is a large multimodal model that achieves state-of-the-art spatio-temporal grounding performance while maintaining strong video question answering capabilities. [2]
Abstract
Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.
Why we think this paper is great for you:
This paper is an excellent match for your interest in multimodal models, specifically focusing on their application to video. It delves into state-of-the-art performance in video understanding and creation, which is a key area of image processing.
Abstract
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox
Why we think this paper is great for you:
This paper directly investigates the fundamental aspects of unified multimodal models, a central theme in your interests. It provides an analytical framework to understand the interplay between understanding and generation in these complex systems.
Abstract
Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.
Why we think this paper is great for you:
You will appreciate this paper's exploration of hybrid architectures, combining convolutional neural networks with other models. Its application to image compression directly aligns with your interest in advanced image processing techniques.
Abstract
Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .
Why we think this paper is great for you:
This paper is highly relevant to your interest in image recognition, specifically addressing image classification under challenging conditions. It introduces a novel diffusion-based approach to improve classification performance.
AI Summary - The model is trained using a combination of diffusion-based solvers and guidance scales, which are carefully tuned for optimal performance. [3]
- Diffusion Models: A type of generative model that learns to represent data as a sequence of noise vectors, where each vector is obtained by adding noise to the previous one. [3]
- PixelDiT is a novel diffusion-based image-to-image translation model that combines the strengths of both diffusion models and transformers. [2]
Abstract
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
Why we think this paper is great for you:
This paper is a strong match for your interest in image processing, focusing on advanced image generation techniques. It introduces a direct pixel-space approach for diffusion transformers, offering insights into modern generative models.