Hi!

Your personalized paper recommendations for 24 to 28 November, 2025.
🎯 Top Personalized Recommendations
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
AI Summary
  • DDC achieves performance improvement in all models in the other five noise scenarios, with a significant increase in accuracy for EfficientNet-B0 model. [3]
  • The performance improvement is more stable and obvious for lightweight models such as EfficientNet-B0. [3]
  • DDC has different performance in improving the anti-interference performance of different models, but has stable noise adaptability as a whole. [3]
  • YOLOv8 performs best after replacing traditional convolution with DDC, with significant improvements in mAP@0.5 and mAP@0.5: 0.95 under six noise scenarios. [3]
  • DDC: Deep Domain Confusion mAP@0.5: mean Average Precision at IoU=0.5 mAP@0.5 : 0.95: mean Average Precision at IoU=0.5 and IoU=0.95 DDC has stable anti-interference ability for most noise types, with significant performance improvement in image classification and object detection tasks. [3]
  • The nonlinear interaction mechanism of DDC can better make up for the shortcomings of traditional convolution framework in feature selection, especially for lightweight models. [3]
  • YOLOv8 is highly compatible with the nonlinear interaction mechanism of DDC, resulting in significant performance improvements in object detection task. [3]
Abstract
In real-world scenarios of image recognition, there exists substantial noise interference. Existing works primarily focus on methods such as adjusting networks or training strategies to address noisy image recognition, and the anti-noise performance has reached a bottleneck. However, little is known about the exploration of anti-interference solutions from a neuronal perspective.This paper proposes an anti-noise neuronal convolution. This convolution mimics the dendritic structure of neurons, integrates the neighborhood interaction computation logic of dendrites into the underlying design of convolutional operations, and simulates the XOR logic preprocessing function of biological dendrites through nonlinear interactions between input features, thereby fundamentally reconstructing the mathematical paradigm of feature extraction. Unlike traditional convolution where noise directly interferes with feature extraction and exerts a significant impact, DDC mitigates the influence of noise by focusing on the interaction of neighborhood information. Experimental results demonstrate that in image classification tasks (using YOLOv11-cls, VGG16, and EfficientNet-B0) and object detection tasks (using YOLOv11, YOLOv8, and YOLOv5), after replacing traditional convolution with the dendritic convolution, the accuracy of the EfficientNet-B0 model on noisy datasets is relatively improved by 11.23%, and the mean Average Precision (mAP) of YOLOv8 is increased by 19.80%. The consistency between the computation method of this convolution and the dendrites of biological neurons enables it to perform significantly better than traditional convolution in complex noisy environments.
Why we think this paper is great for you:
This paper directly addresses your interest in convolution by introducing a novel dendritic convolution method. It specifically applies to image recognition tasks, particularly in challenging noisy environments, which aligns well with your focus on image processing.
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.
Why we think this paper is great for you:
You will find this paper highly relevant as it explores the continual fusion of task-specific models, a core concept within fusion models. It offers insights into building versatile multi-task systems, which is key to your interest in multimodal approaches.
AI Summary
  • Vidi2 is a large multimodal model that achieves state-of-the-art spatio-temporal grounding performance while maintaining strong video question answering capabilities. [2]
Abstract
Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.
Why we think this paper is great for you:
This paper is an excellent match for your interest in multimodal models, specifically focusing on their application to video. It delves into state-of-the-art performance in video understanding and creation, which is a key area of image processing.
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox
Why we think this paper is great for you:
This paper directly investigates the fundamental aspects of unified multimodal models, a central theme in your interests. It provides an analytical framework to understand the interplay between understanding and generation in these complex systems.
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.
Why we think this paper is great for you:
You will appreciate this paper's exploration of hybrid architectures, combining convolutional neural networks with other models. Its application to image compression directly aligns with your interest in advanced image processing techniques.
Abstract
Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .
Why we think this paper is great for you:
This paper is highly relevant to your interest in image recognition, specifically addressing image classification under challenging conditions. It introduces a novel diffusion-based approach to improve classification performance.
AI Summary
  • The model is trained using a combination of diffusion-based solvers and guidance scales, which are carefully tuned for optimal performance. [3]
  • Diffusion Models: A type of generative model that learns to represent data as a sequence of noise vectors, where each vector is obtained by adding noise to the previous one. [3]
  • PixelDiT is a novel diffusion-based image-to-image translation model that combines the strengths of both diffusion models and transformers. [2]
Abstract
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
Why we think this paper is great for you:
This paper is a strong match for your interest in image processing, focusing on advanced image generation techniques. It introduces a direct pixel-space approach for diffusion transformers, offering insights into modern generative models.
fusion models
Abstract
Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem, stellarator design is a time-consuming process that can take hours to solve on a computing cluster. Developing fast methods for designing stellarators is crucial for advancing fusion research. Given the recent development of large datasets of optimized stellarators, machine learning approaches have emerged as a potential candidate. Motivated by this, we present an open inverse problem to the machine learning community: to rapidly generate high-quality stellarator designs which have a set of desirable characteristics. As a case study in the problem space, we train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with desirable characteristics (aspect ratio and mean rotational transform). The diffusion model is applied to design stellarators with characteristics not seen during training. We provide evaluation protocols and show that many of the generated stellarators exhibit solid performance: less than 5% deviation from quasisymmetry and the target characteristics. The modest deviation from quasisymmetry highlights an opportunity to reach the sub 1% target. Beyond the case study, we share multiple promising avenues for generative modeling to advance stellarator design.
AI Summary
  • The model uses a sinusoidal network head to embed the input features before passing them through standard feed-forward layers. [3]
  • The model is trained with 250 epochs, a batch size of 4096, and an evaluation batch size of 128. [3]
  • Prior to training the diffusion model, the dimensionality of the raw training data is reduced using principal component analysis (PCA) to eliminate noise and high frequency oscillations. [3]
  • cΞΉ: A measure of the magnetic surface's shape, with higher values indicating a more complex shape. [3]
  • cA: A measure of the magnetic surface's area, with higher values indicating a larger area. [3]
  • The model is limited in what it can learn due to missing important information in the training set. [3]
  • The model is conditioned on features such as iota, A, nfp, and N. [2]
  • A diffusion model is trained to generate stellarator designs that meet specific performance criteria. [1]
  • JQS: Quasisymmetry objective that computes the relative distance between the quasisymmetric and non-quasisymmetric components of the field strength. [0]
Image Processing
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.
AI Summary
  • CFM significantly outperforms DDPM (FID 24.15 vs 402.98) with the same architecture MeanFlow enables one-step generation (FID 29.15) with 50Γ— speedup Fine-tuning dramatically improves inpainting: +74% PSNR, +43% SSIM CFM: Conditional Flow Model DDPM: Denoising Diffusion Probabilistic Models MeanFlow: One-step generation via mean velocities FID: FrΓ©chet Inception Distance PSNR: Peak Signal-to-Noise Ratio SSIM: Structural Similarity Index Measure [3]
  • The paper presents a comprehensive comparison of DDPM, CFM, and MeanFlow on CIFAR-10, with an extension to image inpainting. [2]
Image Recognition
Abstract
We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.
AI Summary
  • The paper presents a new method for smoothing sequences, called GRW-smoothing. [3]
  • The paper focuses on the scaling behavior of GRW-smoothing and provides a proof for the optimal solution in the one-dimensional case. [3]
  • GRW-smoothing: A method for smoothing sequences based on velocity and acceleration vectors. [3]
  • L(Z): The loss function for GRW-smoothing, which is the sum of the velocity and acceleration terms. [3]
  • Lv(Z): The velocity term in the loss function, which is the sum of the squared velocities. [3]
  • La(Z): The acceleration term in the loss function, which is the negative log likelihood of the exponential of the sum of the squared accelerations. [3]
  • R(Z): The range of the configuration Z, which is defined as R(Z) := zT. [3]
  • The authors prove that the optimal solution for a smoothing window of size T lies within a ball of radius bounded by O(T√lnT). [2]
  • This method is based on the concept of velocity and acceleration vectors in a sequence. [1]