EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

MBZUAI

Why we think this paper is great for you:
This paper explores advancements in large multimodal models, directly aligning with your interest in multimodal architectures and their perception capabilities. You will find its focus on self-evolving LMMs particularly insightful.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

AI Summary

The framework introduces a novel continuous self-consistency reward mechanism for the Solver, which provides dense, bounded gradients based on multi-sample answer agreement, overcoming the instability and sparsity of discrete majority-vote rewards in multimodal settings. [3]
EvoLMM achieves consistent absolute gains of 2-3% on challenging multimodal math and scientific reasoning benchmarks (e.g., ChartQA, MathVista) over strong baselines like Qwen2.5-VL-7B, using only raw training images. [3]
Parameter-efficient fine-tuning methods like LoRA are crucial for stable self-evolution in EvoLMM, preserving pretrained multimodal grounding while allowing effective Proposer-Solver co-adaptation, unlike full fine-tuning which can lead to instability and performance degradation. [3]
The self-evolving mechanism scales effectively with model size, with larger LMMs exhibiting stronger absolute gains, indicating that increased capacity allows for more refined internal reasoning when guided by continuous agreement feedback. [3]
EvoLMM enables large multimodal models (LMMs) to self-improve their visual reasoning capabilities in a purely unsupervised manner, eliminating reliance on human-annotated data or external reward models. [2]
The approach demonstrates strong transferability and architecture-agnosticism, yielding similar performance improvements across diverse LMM backbones (e.g., InternVL3, Gemma, Llama-3.2) without architectural modifications or specific dataset alignments. [2]
EvoLMM: A self-evolving framework for Large Multimodal Models (LMMs) that enables unsupervised improvement of visual reasoning abilities by instantiating cooperative Proposer and Solver agents from a single backbone model, learning through continuous self-consistency rewards. [2]
Proposer (πϕ(q|x)): An agent within EvoLMM responsible for generating diverse, visually grounded questions (q) from an input image (x). [2]
An entropy-guided continuous Proposer reward dynamically encourages the generation of moderately difficult questions, creating an implicit curriculum that adapts to the Solver's evolving capabilities and prevents the Proposer from generating trivial or unsolvable tasks. [1]
Solver (πθ(y|x, q)): An agent within EvoLMM that attempts to answer the questions (q) generated by the Proposer for a given image (x). [1]

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Tencent Hunyuan

Why we think this paper is great for you:
This work on 3D multimodal large language models is a strong match, offering insights into unifying diverse 3D tasks with multimodal approaches. It addresses both multimodal concepts and 3D image understanding.

Rate paper: 👍 👎 ♥ Save

Abstract
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/

A Bayesian INLA-SPDE Approach to Spatio-Temporal Point-Grid Fusion with Change-of-Support and Misaligned Covariates

University of Glasgow

Why we think this paper is great for you:
This paper directly addresses data fusion, a core area of your interest, by proposing a framework for integrating different types of spatio-temporal data. Its methodological approach to fusion will be highly relevant.

Rate paper: 👍 👎 ♥ Save

Abstract
We propose a spatio-temporal data-fusion framework for point data and gridded data with variables observed on different spatial supports. A latent Gaussian field with a Matérn-SPDE prior provides a continuous space representation, while source-specific observation operators map observations to both point measurements and gridded averages, addressing change-of-support and covariate misalignment. Additionally incorporating temporal dependence enables prediction at unknown locations and time points. Inference and prediction are performed using the Integrated Nested Laplace Approximation and the Stochastic Partial Differential Equations approach, which delivers fast computation with uncertainty quantification. Our contributions are: a hierarchical model that jointly fuses multiple data sources of the same variable under different spatial and temporal resolutions and measurement errors, and a practical implementation that incorporates misaligned covariates via the same data fusion framework allowing differing covariate supports. We demonstrate the utility of this framework via simulations calibrated to realistic sensor densities and spatial coverage. Using the simulation framework, we explore the stability and performance of the approach with respect to the number of time points and data/covariate availability, demonstrating gains over single-source models through point and gridded data fusion. We apply our framework to soil moisture mapping in the Elliot Water catchment (Angus, Scotland). We fuse in-situ sensor data with aligned and misaligned covariates, satellite data and elevation data to produce daily high resolution maps with uncertainty.

A Simple and Robust Multi-Fidelity Data Fusion Method for Effective Modeling of Citizen-Science Air Pollution Data

ETH Zurich

Why we think this paper is great for you:
You will find this paper highly relevant as it presents a robust method for multi-fidelity data fusion, focusing on integrating diverse data sources. Its practical application of fusion techniques is a strong point.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
We propose a robust multi-fidelity Gaussian process for integrating sparse, high-quality reference monitors with dense but noisy citizen-science sensors. The approach replaces the Gaussian log-likelihood in the high-fidelity channel with a global Huber loss applied to precision-weighted residuals, yielding bounded influence on all parameters, including the cross-fidelity coupling, while retaining the flexibility of co-kriging. We establish attenuation and unbounded influence of the Gaussian maximum likelihood estimator under low-fidelity contamination and derive explicit finite bounds for the proposed estimator that clarify how whitening and mean-shift sensitivity determine robustness. Monte Carlo experiments with controlled contamination show that the robust estimator maintains stable MAE and RMSE as anomaly magnitude and frequency increase, whereas the Gaussian MLE deteriorates rapidly. In an empirical study of PM2.5 concentrations in Hamburg, combining UBA monitors with openSenseMap data, the method consistently improves cross-validated predictive accuracy and yields coherent uncertainty maps without relying on auxiliary covariates. The framework remains computationally scalable through diagonal or low-rank whitening and is fully reproducible with publicly available code.

Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

METU

Why we think this paper is great for you:
This paper on unsupervised image classification and clustering is highly pertinent, offering methods for grouping unlabeled images into meaningful categories. It directly contributes to your interest in image recognition.

Rate paper: 👍 👎 ♥ Save

Abstract
Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.

ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation

Shanghai JiaoTong University

Why we think this paper is great for you:
This paper presents a unified framework for image manipulation detection and localization, aligning well with your interest in image analysis and recognition tasks. It addresses critical challenges in discerning manipulated content.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.

How Noise Benefits AI-generated Image Detection

NUIST

Why we think this paper is great for you:
This paper investigates AI-generated image detection, a key aspect of image recognition, and explores how noise can be leveraged in this domain. It offers unique insights into the robustness of detection methods.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.

Interests not found

Help us improve your experience!