Hi!

Your personalized paper recommendations for 02 to 06 February, 2026.

Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

University of Trento

Rate paper: 👍 👎 ♥ Save

AI Insights

Model merging: The process of combining multiple LLMs to create a new model that can perform tasks more effectively. (ML: 0.96)👍👎
Future work could involve exploring other methods for model merging and evaluating their performance on different tasks. (ML: 0.95)👍👎
Multimodal large language models (LLMs): Models that can process multiple types of data, such as text, images, and audio. (ML: 0.94)👍👎
Model merging is a promising approach for developing more robust and effective LLMs. (ML: 0.94)👍👎
Model merging involves combining multiple LLMs to create a new model that can perform tasks more effectively. (ML: 0.93)👍👎
The paper discusses the concept of model merging for multimodal large language models (LLMs). (ML: 0.91)👍👎
The paper demonstrates the effectiveness of AdamMS in merging heterogeneous multimodal LLMs and improving performance on various benchmarks. (ML: 0.83)👍👎
The authors propose a method called AdamMS, which uses unsupervised coefficient optimization to merge heterogeneous multimodal LLMs. (ML: 0.83)👍👎
AdamMS: A method for merging heterogeneous multimodal LLMs using unsupervised coefficient optimization. (ML: 0.82)👍👎
AdamMS is evaluated on several benchmarks and shows improved performance compared to other methods. (ML: 0.78)👍👎

Abstract
Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.

Why we are recommending this paper?
Due to your Interest in multimodal models

This paper explores optimizing multimodal data mixtures, a key area for fusion models. Given your interest in multimodal models and fusion, this work directly addresses a critical challenge in combining different data sources.

Contour Refinement using Discrete Diffusion in Low Data Regime

University of Toronto

Rate paper: 👍 👎 ♥ Save

AI Insights

Diffusion models: A type of probabilistic model that can generate new samples by iteratively refining an initial input. (ML: 0.93)👍👎
Generative adversarial networks (GANs): A class of deep learning algorithms used for generative modeling, which can be applied to medical image segmentation tasks. (ML: 0.91)👍👎
Generative adversarial networks (GANs) have been widely used for medical image segmentation tasks. (ML: 0.88)👍👎
EfficientNet: A scalable neural network architecture designed for computer vision tasks, which has achieved state-of-the-art results in various applications. (ML: 0.88)👍👎
The use of diffusion models and GANs in medical image segmentation has shown promising results, but there is still a need for further research to improve the accuracy and robustness of these methods. (ML: 0.88)👍👎
The use of diffusion models in medical image segmentation has gained significant attention in recent years. (ML: 0.87)👍👎
EfficientNet is a scalable neural network architecture that has achieved state-of-the-art results in various computer vision tasks. (ML: 0.87)👍👎
U-Net: A popular architecture for biomedical image segmentation that uses a combination of convolutional and transposed convolutional layers. (ML: 0.82)👍👎
U-Net and its variants remain popular architectures for biomedical image segmentation tasks due to their simplicity and effectiveness. (ML: 0.80)👍👎
U-Net is a popular architecture for biomedical image segmentation, and its variants are still being developed and improved. (ML: 0.75)👍👎

Abstract
Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size <500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.

Why we are recommending this paper?
Due to your Interest in Image Processing

Focusing on image inpainting with limited data aligns with your interest in image processing techniques. The use of diffusion models is particularly relevant to your interest in multimodal models.

MambaVF: State Space Model for Efficient Video Fusion

ETH Zurich

Rate paper: 👍 👎 ♥ Save

AI Insights

PSNR: Peak Signal-to-Noise Ratio, a metric used to measure the quality of an image or video. (ML: 0.91)👍👎
SSIM: Structural Similarity Index Measure, a metric used to compare the similarity between two images or videos. (ML: 0.88)👍👎
Video fusion: The process of combining multiple input videos into a single output video that contains information from all input sources. (ML: 0.86)👍👎
The method uses a bidirectional state space model to learn the temporal dependencies between frames and fuse them into high-quality output videos. (ML: 0.80)👍👎
The bidirectional state space model used in MambaVF enables the learning of temporal dependencies between frames and the fusion of complementary features into high-quality output videos. (ML: 0.79)👍👎
The paper also provides additional qualitative fusion comparison results to further demonstrate the performance of MambaVF. (ML: 0.76)👍👎
MambaVF achieves state-of-the-art performance on several benchmark datasets, outperforming existing methods in terms of PSNR and SSIM metrics. (ML: 0.74)👍👎
MambaVF is an efficient and effective method for video fusion tasks, achieving state-of-the-art performance on several benchmark datasets. (ML: 0.71)👍👎
The paper presents a state space model for efficient video fusion, called MambaVF. (ML: 0.68)👍👎
MambaVF is designed to handle various video fusion tasks, including multi-exposure, multi-focus, infrared-visible, and medical video fusion. (ML: 0.58)👍👎

Abstract
Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io

Why we are recommending this paper?
Due to your Interest in fusion models

Given your interest in image recognition and convolution, this paper’s focus on efficient video fusion using state space models is highly relevant. The work addresses the challenges of processing video data, a common application of convolutional techniques.

Dynamical Regimes of Multimodal Diffusion Models

University of California, Berkeley

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The paper provides a theoretical framework for understanding diffusion models and proposes an experiment to validate it. (ML: 0.96)👍👎
The authors propose a minimal image experiment using MNIST images to validate their theoretical framework. (ML: 0.86)👍👎
The results of the experiment are expected to provide insights into the behavior of diffusion models. (ML: 0.83)👍👎
MNIST Synchronization Experiment: An experiment designed to validate the theoretical framework using MNIST images. (ML: 0.82)👍👎
The paper discusses a theoretical framework for understanding the behavior of diffusion models in terms of two distinct transitions: speciation and collapse. (ML: 0.81)👍👎
Synchronization gap: A period where the global structure is already decided while modality-specific discrepancies remain unstable. (ML: 0.81)👍👎
The experiment involves training an unconditional ε-prediction diffusion model on a two-channel state, with the goal of observing mode ordering during the reverse process. (ML: 0.80)👍👎
The experiment involves training a model on a two-channel state and observing mode ordering during the reverse process. (ML: 0.80)👍👎
Speciation: The transition from regime I (high noise) to regime II (clustered structure). (ML: 0.76)👍👎
Collapse: The transition from regime II to regime III (condensation). (ML: 0.75)👍👎

Abstract
Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap'', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.

Why we are recommending this paper?
Due to your Interest in multimodal models

This paper investigates diffusion models, a rapidly growing area of interest within multimodal generation. Understanding the theoretical foundations of these models is crucial given your interest in fusion models.

ACIL: Active Class Incremental Learning for Image Classification

Florida State University

Rate paper: 👍 👎 ♥ Save

AI Insights

The selection of informative classes using the core-set approach may not always result in optimal performance. (ML: 0.98)👍👎
Active learning: A machine learning paradigm that involves selecting a subset of the training data to be labeled by an oracle or human annotator. (ML: 0.96)👍👎
Exemplar-based method: A type of active learning approach that uses a set of exemplars, which are representative instances of each class, to adapt the model. (ML: 0.94)👍👎
The paper proposes a novel approach to class-incremental learning by combining active learning and exemplar-based methods. (ML: 0.94)👍👎
The proposed method requires a large number of exemplars to be stored in memory, which may not be feasible for large-scale datasets. (ML: 0.93)👍👎
Class-incremental learning: A type of transfer learning where the model is trained on multiple classes incrementally, with each new class being added to the existing knowledge. (ML: 0.93)👍👎
ACIL achieves state-of-the-art performance in class-incremental learning by effectively selecting informative classes and adapting the model using exemplars. (ML: 0.93)👍👎
The proposed method, called Active Class-Incremental Learning (ACIL), selects the most informative classes for training and uses exemplars from these classes to adapt the model. (ML: 0.93)👍👎
The proposed method has the potential to be applied to various real-world applications, such as image classification, object detection, and segmentation. (ML: 0.82)👍👎
ACIL is evaluated on several benchmark datasets, including CIFAR-10, CIFAR-100, and Tiny ImageNet, and outperforms state-of-the-art methods in terms of accuracy and efficiency. (ML: 0.81)👍👎

Abstract
Continual learning (or class incremental learning) is a realistic learning scenario for computer vision systems, where deep neural networks are trained on episodic data, and the data from previous episodes are generally inaccessible to the model. Existing research in this domain has primarily focused on avoiding catastrophic forgetting, which occurs due to the continuously changing class distributions in each episode and the inaccessibility of the data from previous episodes. However, these methods assume that all the training samples in every episode are annotated; this not only incurs a huge annotation cost, but also results in a wastage of annotation effort, since most of the samples in a given episode will not be accessible to the model in subsequent episodes. Active learning algorithms identify the salient and informative samples from large amounts of unlabeled data and are instrumental in reducing the human annotation effort in inducing a deep neural network. In this paper, we propose ACIL, a novel active learning framework for class incremental learning settings. We exploit a criterion based on uncertainty and diversity to identify the exemplar samples that need to be annotated in each episode, and will be appended to the data in the next episode. Such a framework can drastically reduce annotation cost and can also avoid catastrophic forgetting. Our extensive empirical analyses on several vision datasets corroborate the promise and potential of our framework against relevant baselines.

Why we are recommending this paper?
Due to your Interest in Image Recognition

The paper's focus on active class incremental learning directly relates to continual learning, a core area within image recognition and convolution. This addresses the need for models that can adapt to evolving data distributions.

Depth as Prior Knowledge for Object Detection

Karlsruhe Institute of Technology

Rate paper: 👍 👎 ♥ Save

AI Insights

Treating depth as a prior rather than a fused feature Depth-induced heteroscedasticity creates systematic training bias toward nearby objects Depth-informed supervision exploits geometric priors without architectural modifications Depth-Based Loss Weighting (DLW) Depth-Based Loss Stratification (DLS) Depth-Aware Confidence Thresholding (DCT) DepthPrior is a complete depth-aware detection system that maximizes performance through coordinated training and inference interventions Depth-based supervision provides benefits without architectural modifications robustness to hyperparameters is not extensively explored no extensive comparison with other state-of-the-art methods (ML: 0.84)👍👎

Abstract
Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.

Why we are recommending this paper?
Due to your Interest in Image Recognition

Towards Understanding and Avoiding Limitations of Convolutions on Graphs

Technische Universitt Dortmund

Rate paper: 👍 👎 ♥ Save

AI Insights

Graph signal: A function that assigns a value to each node in a graph. (ML: 0.95)👍👎
Filter: A function that takes a graph signal as input and produces another graph signal by applying a transformation. (ML: 0.94)👍👎
Differentiable classifiers have shown improved scalability and performance for graph classification tasks, especially when large amounts of data are available. (ML: 0.93)👍👎
Graph classification is a crucial task in machine learning, with applications in various domains such as social network analysis, bioinformatics, and computer vision. (ML: 0.93)👍👎
Graph convolution: An operation on two functions (a graph signal and a filter) that produces a third function by shifting one of the functions. (ML: 0.91)👍👎
Differentiable classifiers have emerged as a promising method for graph classification, offering improved scalability and performance compared to traditional methods like graph kernels. (ML: 0.90)👍👎
Permutation invariant function: A function that operates on multisets of node representations, making it equivalent to operating on graphs with different structural representations. (ML: 0.90)👍👎
The graph convolution operation is a key component of differentiable classifiers, enabling the extraction of structural information from graphs and interactions between node attributes. (ML: 0.90)👍👎
The graph convolution operation is a crucial component of differentiable classifiers, enabling the extraction of structural information from graphs and interactions between node attributes. (ML: 0.89)👍👎
Further research is needed to improve the expressivity of differentiable classifiers and address challenges such as handling continuous node attributes and improving computational efficiency. (ML: 0.89)👍👎

Abstract
While message-passing neural networks (MPNNs) have shown promising results, their real-world impact remains limited. Although various limitations have been identified, their theoretical foundations remain poorly understood, leading to fragmented research efforts. In this thesis, we provide an in-depth theoretical analysis and identify several key properties limiting the performance of MPNNs. Building on these findings, we propose several frameworks that address these shortcomings. We identify two properties exhibited by many MPNNs: shared component amplification (SCA), where each message-passing iteration amplifies the same components across all feature channels, and component dominance (CD), where a single component gets increasingly amplified as more message-passing steps are applied. These properties lead to the observable phenomenon of rank collapse of node representations, which generalizes the established over-smoothing phenomenon. By generalizing and decomposing over-smoothing, we enable a deeper understanding of MPNNs, more targeted solutions, and more precise communication within the field. To avoid SCA, we show that utilizing multiple computational graphs or edge relations is necessary. Our multi-relational split (MRS) framework transforms any existing MPNN into one that leverages multiple edge relations. Additionally, we introduce the spectral graph convolution for multiple feature channels (MIMO-GC), which naturally uses multiple computational graphs. A localized variant, LMGC, approximates the MIMO-GC while inheriting its beneficial properties. To address CD, we demonstrate a close connection between MPNNs and the PageRank algorithm. Based on personalized PageRank, we propose a variant of MPNNs that allows for infinitely many message-passing iterations, while preserving initial node features. Collectively, these results deepen the theoretical understanding of MPNNs.

Why we are recommending this paper?
Due to your Interest in convolution

Dynamic High-frequency Convolution for Infrared Small Target Detection

National University of Defense Technology

Rate paper: 👍 👎 ♥ Save

AI Insights

Larger kernel sizes do not obviously improve detection performance for any type of convolution, while causing higher computational costs. (ML: 0.89)👍👎
Functions that map filter parameters into a zero-centered interval, such as tanh(·), are essential for DHiF. (ML: 0.73)👍👎
DHiF achieves superior detection performance compared to other state-of-the-art convolutions, including CDC, WTConv, and PConv. (ML: 0.69)👍👎
The optimal structure of the DHiF-Res block involves implementing DHiF first followed by standard convolution. (ML: 0.66)👍👎
The proposed Dynamic High-Frequency (DHiF) convolutional layer is designed to capture local high-frequency information in images. (ML: 0.65)👍👎
A zero-centered distribution is necessary and sufficient to make DHiF sensitive to high-frequency components. (ML: 0.59)👍👎
DHiF: Dynamic High-Frequency convolutional layer HFCs: High-Frequency Components GELU: Gaussian Error Linear Unit LeakyReLU: Leaky Rectified Linear Unit (ML: 0.51)👍👎

Abstract
Infrared small targets are typically tiny and locally salient, which belong to high-frequency components (HFCs) in images. Single-frame infrared small target (SIRST) detection is challenging, since there are many HFCs along with targets, such as bright corners, broken clouds, and other clutters. Current learning-based methods rely on the powerful capabilities of deep networks, but neglect explicit modeling and discriminative representation learning of various HFCs, which is important to distinguish targets from other HFCs. To address the aforementioned issues, we propose a dynamic high-frequency convolution (DHiF) to translate the discriminative modeling process into the generation of a dynamic local filter bank. Especially, DHiF is sensitive to HFCs, owing to the dynamic parameters of its generated filters being symmetrically adjusted within a zero-centered range according to Fourier transformation properties. Combining with standard convolution operations, DHiF can adaptively and dynamically process different HFC regions and capture their distinctive grayscale variation characteristics for discriminative representation learning. DHiF functions as a drop-in replacement for standard convolution and can be used in arbitrary SIRST detection networks without significant decrease in computational efficiency. To validate the effectiveness of our DHiF, we conducted extensive experiments across different SIRST detection networks on real-scene datasets. Compared to other state-of-the-art convolution operations, DHiF exhibits superior detection performance with promising improvement. Codes are available at https://github.com/TinaLRJ/DHiF.

Why we are recommending this paper?
Due to your Interest in convolution

Thinking inside the Convolution for Image Inpainting: Reconstructing Texture via Structure under Global and Local Side

Hefei University of Technology

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Image inpainting has earned substantial progress, owing to the encoder-and-decoder pipeline, which is benefited from the Convolutional Neural Networks (CNNs) with convolutional downsampling to inpaint the masked regions semantically from the known regions within the encoder, coupled with an upsampling process from the decoder for final inpainting output. Recent studies intuitively identify the high-frequency structure and low-frequency texture to be extracted by CNNs from the encoder, and subsequently for a desirable upsampling recovery. However, the existing arts inevitably overlook the information loss for both structure and texture feature maps during the convolutional downsampling process, hence suffer from a non-ideal upsampling output. In this paper, we systematically answer whether and how the structure and texture feature map can mutually help to alleviate the information loss during the convolutional downsampling. Given the structure and texture feature maps, we adopt the statistical normalization and denormalization strategy for the reconstruction guidance during the convolutional downsampling process. The extensive experimental results validate its advantages to the state-of-the-arts over the images from low-to-high resolutions including 256*256 and 512*512, especially holds by substituting all the encoders by ours. Our code is available at https://github.com/htyjers/ConvInpaint-TSGL

Why we are recommending this paper?
Due to your Interest in Image Processing

Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

North American House Finch

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Adaptive evidence weighting can improve performance by selectively incorporating contextual evidence only when it is informative. (ML: 0.99)👍👎
This ensures that the model cannot be forced to rely on misleading contextual information. (ML: 0.98)👍👎
The assumption of conditional independence may not hold universally in practice, and the model may not be robust to violations of this assumption. (ML: 0.98)👍👎
Gated Fusion Mechanism: A decision-theoretically safe mechanism for combining audio-only and contextual evidence, which adapts to the reliability of each predictor based on input-dependent weights. (ML: 0.96)👍👎
Previous work has shown that combining audio-only and spatiotemporal predictors can improve performance for certain tasks. (ML: 0.96)👍👎
This approach preserves a safe fallback to the audio-only classifier and can improve performance by selectively incorporating contextual evidence only when it is informative. (ML: 0.92)👍👎
Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion Conditional Independence: The assumption that audio features and spatiotemporal contexts are conditionally independent given the class label. (ML: 0.89)👍👎
However, existing approaches often rely on fixed-weight fusion or joint models, which can be brittle and require paired supervision. (ML: 0.89)👍👎
Imagine you're trying to identify a bird species based on its song and the location where it was recorded. (ML: 0.88)👍👎
The paper proposes a way to combine information from both sources in a more flexible and adaptive way, so that we can take advantage of the strengths of each source while minimizing their weaknesses. (ML: 0.88)👍👎
Log-Linear Fusion: A fusion rule that combines independently trained predictors by taking a weighted sum of their log probabilities. (ML: 0.83)👍👎
The paper proposes an adaptive gated fusion mechanism for combining audio-only and spatiotemporal predictors, which adapts to the reliability of each predictor based on input-dependent weights. (ML: 0.79)👍👎
The proposed gated fusion mechanism is decision-theoretically safe in the sense of risk containment, admitting the audio-only classifier as a recoverable special case. (ML: 0.74)👍👎

Abstract
Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce \textbf{F}usion under \textbf{IN}dependent \textbf{C}onditional \textbf{H}ypotheses (\textbf{FINCH}), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emph{contains} the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: \texttt{\href{https://anonymous.4open.science/r/birdnoise-85CD/README.md}{anonymous-repository}}

Why we are recommending this paper?
Due to your Interest in fusion models

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback