MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

Why we think this paper is great for you:
This paper directly addresses the evaluation of Large Multimodal Models, which is a core area of your interest. It explores how these models can perform critique, enhancing their capabilities in visual and language tasks.

Rate paper: 👍 👎 ♥ Save

Abstract
The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs' critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at https://github.com/MichealZeng0420/MM-Critic.

AI Summary

A clear scaling law is observed in LMM critique capabilities, where models within the same series consistently demonstrate improved performance with increased parameter size, validating the benchmark's robustness. [3]
There is an inherent relationship between response quality and critique scores, with medium-quality responses being the most challenging to critique, often receiving the lowest scores. [3]
A potential judgment bias exists where judge models (e.g., GPT-4.1) tend to assign higher scores to longer, more elaborate textual critiques, suggesting a correlation between critique length and perceived quality. [3]
MM-CRITIC provides a holistic benchmark for LMM critique, evaluating across basic, correction, and comparison dimensions, encompassing 8 main task types and 4471 samples. [2]
The benchmark enhances evaluation reliability by integrating expert-informed ground answers into scoring rubrics, guiding GPT-4o to generate reference critiques that serve as trustworthy judgment anchors. [2]
Correction critique is generally more challenging for LMMs than basic critique, and comparative critique proves particularly difficult when distinguishing between medium and high-quality responses. [2]
The use of reference critiques, anchored at a human-expert level (score 8), significantly improves the reliability of judge model assessments by providing a comparative baseline for evaluating LMM-generated critiques. [2]
MM-CRITIC: A holistic benchmark designed to comprehensively and reliably measure the critique capability of Large Multimodal Models (LMMs) across multiple dimensions. [2]
Basic Critique: An evaluation dimension assessing an LMM's ability to judge the correctness of a single response and provide textual feedback. [2]
Correction Critique: An evaluation dimension assessing an LMM's ability to identify and correct errors in a given response, reflecting its self-improvement potential. [2]

Towards Robust Multimodal Learning in the Open World

Why we think this paper is great for you:
You will find this paper highly relevant as it focuses on advancing multimodal learning, leveraging diverse data streams including vision. It explores how to build more robust systems in complex, real-world scenarios.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements.

Improving VisNet for Object Recognition

Why we think this paper is great for you:
This paper directly investigates improving object recognition, a fundamental task you are interested in, by exploring artificial systems inspired by biological vision. It offers insights into enhancing recognition capabilities.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Object recognition plays a fundamental role in how biological organisms perceive and interact with their environment. While the human visual system performs this task with remarkable efficiency, reproducing similar capabilities in artificial systems remains challenging. This study investigates VisNet, a biologically inspired neural network model, and several enhanced variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing for both general object recognition and symmetry classification. By leveraging principles of Hebbian learning and temporal continuity associating temporally adjacent views to build invariant representations. VisNet and its extensions capture robust and transformation invariant features. Experimental results across multiple datasets, including MNIST, CIFAR10, and custom symmetric object sets, show that these enhanced VisNet variants substantially improve recognition accuracy compared with the baseline model. These findings underscore the adaptability and biological relevance of VisNet inspired architectures, offering a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence. Keywords: VisNet, Object Recognition, Symmetry Detection, Hebbian Learning, RBF Neurons, Mahalanobis Distance, Biologically Inspired Models, Invariant Representations

Spatial Information Bottleneck for Interpretable Visual Recognition

Why we think this paper is great for you:
This paper is a great match because it delves into interpretable visual recognition, addressing how deep networks learn and how to make their decisions clearer. It aligns perfectly with your interest in understanding and improving image recognition systems.

Rate paper: 👍 👎 ♥ Save

Abstract
Deep neural networks typically learn spatially entangled representations that conflate discriminative foreground features with spurious background correlations, thereby undermining model interpretability and robustness. We propose a novel understanding framework for gradient-based attribution from an information-theoretic perspective. We prove that, under mild conditions, the Vector-Jacobian Products (VJP) computed during backpropagation form minimal sufficient statistics of input features with respect to class labels. Motivated by this finding, we propose an encoding-decoding perspective : forward propagation encodes inputs into class space, while VJP in backpropagation decodes this encoding back to feature space. Therefore, we propose Spatial Information Bottleneck (S-IB) to spatially disentangle information flow. By maximizing mutual information between foreground VJP and inputs while minimizing mutual information in background regions, S-IB encourages networks to encode information only in class-relevant spatial regions. Since post-hoc explanation methods fundamentally derive from VJP computations, directly optimizing VJP's spatial structure during training improves visualization quality across diverse explanation paradigms. Experiments on five benchmarks demonstrate universal improvements across six explanation methods, achieving better foreground concentration and background suppression without method-specific tuning, alongside consistent classification accuracy gains.

SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous Data

Why we think this paper is great for you:
This paper introduces a novel fusion technique within federated learning, directly aligning with your interest in fusion models. It addresses challenges in combining model partitions, which is crucial for distributed learning systems.

Rate paper: 👍 👎 ♥ Save

Abstract
Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25$\times$). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Why we think this paper is great for you:
This paper explores image manipulation and detection using CLIP-based models, touching upon multimodal alignment. It offers insights into how feature-space dynamics impact image processing tasks.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.

Diffusion-Based Quality Control of Medical Image Segmentations across Organs

Why we think this paper is great for you:
This paper is relevant to your interest in image processing, specifically focusing on quality control for medical image segmentation. It addresses critical issues like hallucinations in deep learning models for anatomical analysis.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Medical image segmentation using deep learning (DL) has enabled the development of automated analysis pipelines for large-scale population studies. However, state-of-the-art DL methods are prone to hallucinations, which can result in anatomically implausible segmentations. With manual correction impractical at scale, automated quality control (QC) techniques have to address the challenge. While promising, existing QC methods are organ-specific, limiting their generalizability and usability beyond their original intended task. To overcome this limitation, we propose no-new Quality Control (nnQC), a robust QC framework based on a diffusion-generative paradigm that self-adapts to any input organ dataset. Central to nnQC is a novel Team of Experts (ToE) architecture, where two specialized experts independently encode 3D spatial awareness, represented by the relative spatial position of an axial slice, and anatomical information derived from visual features from the original image. A weighted conditional module dynamically combines the pair of independent embeddings, or opinions to condition the sampling mechanism within a diffusion process, enabling the generation of a spatially aware pseudo-ground truth for predicting QC scores. Within its framework, nnQC integrates fingerprint adaptation to ensure adaptability across organs, datasets, and imaging modalities. We evaluated nnQC on seven organs using twelve publicly available datasets. Our results demonstrate that nnQC consistently outperforms state-of-the-art methods across all experiments, including cases where segmentation masks are highly degraded or completely missing, confirming its versatility and effectiveness across different organs.

Help us improve your experience!