Image Recognition

DescribeEarth: Describe Anything for Remote Sensing Images

Xian Jiaotong University

Abstract
Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.

👍 👎 ♥ Save

A signal separation view of classification

Institute of Mathematical

Abstract
The problem of classification in machine learning has often been approached in terms of function approximation. In this paper, we propose an alternative approach for classification in arbitrary compact metric spaces which, in theory, yields both the number of classes, and a perfect classification using a minimal number of queried labels. Our approach uses localized trigonometric polynomial kernels initially developed for the point source signal separation problem in signal processing. Rather than point sources, we argue that the various classes come from different probability distributions. The localized kernel technique developed for separating point sources is then shown to separate the supports of these distributions. This is done in a hierarchical manner in our MASC algorithm to accommodate touching/overlapping class boundaries. We illustrate our theory on several simulated and real life datasets, including the Salinas and Indian Pines hyperspectral datasets and a document dataset.

multimodal models

👍 👎 ♥ Save

Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Utrecht University

Abstract
State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.

👍 👎 ♥ Save

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Renmin University of Chin

Abstract
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

convolution

👍 👎 ♥ Save

On Averages of Shifted Convolutions with Applications to $GL(2)$ and $GL(3)$ Fourier Coefficients

JISEONG KIM

Abstract
In this paper, we study the average of shifted sum for general multiplicative functions. As applications, we prove non-trivial upper bounds for weighted averages of shifted convolutions involving $GL(2)$ and $GL(3)$ Fourier coefficients without smoothing. We apply square-root cancellation on average over short intervals for $GL(2)$ Fourier coefficients with the standard Hardy-Littlewood circle method.

👍 👎 ♥ Save

Batch-CAM: Introduction to better reasoning in convolutional deep learning models

ISTI, CNR, Via Giuseppe M

Abstract
Understanding the inner workings of deep learning models is crucial for advancing artificial intelligence, particularly in high-stakes fields such as healthcare, where accurate explanations are as vital as precision. This paper introduces Batch-CAM, a novel training paradigm that fuses a batch implementation of the Grad-CAM algorithm with a prototypical reconstruction loss. This combination guides the model to focus on salient image features, thereby enhancing its performance across classification tasks. Our results demonstrate that Batch-CAM achieves a simultaneous improvement in accuracy and image reconstruction quality while reducing training and inference times. By ensuring models learn from evidence-relevant information,this approach makes a relevant contribution to building more transparent, explainable, and trustworthy AI systems.

AI Insights

Batch‑CAM introduces a loss that penalizes misleading Grad‑CAM maps, forcing the network to align predictions with true salient regions.
The combined cross‑entropy plus explanation loss yields a 2–3% boost on MNIST and ImageNet‑subset benchmarks while cutting inference time by ~15%.
Batch‑CAM’s prototypical reconstruction term encourages feature maps to reconstruct the input, revealing a dual role as both classifier and auto‑encoder.
Despite its gains, the method’s extra Grad‑CAM forward passes can inflate GPU memory usage, limiting scalability to very deep models.
For deeper dives, see the original Grad‑CAM paper and the open‑source explainerdashboard repository for interactive visualizations.
Key definitions: Explainability = clear, accurate reasoning for predictions; Interpretability = human‑understandable internal mechanics.

Image Processing

👍 👎 ♥ Save

Image Generation Based on Image Style Extraction

Shuochen Chang

Rate this image: 😍 👍 👎

Abstract
Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

👍 👎 ♥ Save

A Recursive Pyramidal Algorithm for Solving the Image Registration Problem

Abstract
The problem of image registration is finding a transformation that aligns two images, such that the corresponding points are in the same location. This paper introduces a simple, end-to-end trainable algorithm that is implementable in a few lines of Python code. The approach is shown to work with very little training data and training time, while achieving accurate results in some settings. An example application to stereo vision was trained from 74 images on a 19x15 input window. With just a dozen lines of Python code this algorithm excels in brevity and may serve as a good start in related scenarios with limitations to training data, training time or code complexity.

fusion models

👍 👎 ♥ Save

TP-MVCC: Tri-plane Multi-view Fusion Model for Silkie Chicken Counting

College of Computer Sci

Abstract
Accurate animal counting is essential for smart farming but remains difficult in crowded scenes due to occlusions and limited camera views. To address this, we propose a tri-plane-based multi-view chicken counting model (TP-MVCC), which leverages geometric projection and tri-plane fusion to integrate features from multiple cameras onto a unified ground plane. The framework extracts single-view features, aligns them via spatial transformation, and decodes a scene-level density map for precise chicken counting. In addition, we construct the first multi-view dataset of silkie chickens under real farming conditions. Experiments show that TP-MVCC significantly outperforms single-view and conventional fusion comparisons, achieving 95.1\% accuracy and strong robustness in dense, occluded scenarios, demonstrating its practical potential for intelligent agriculture.

👍 👎 ♥ Save

Trajectory Based Observer Design: A Framework for Lightweight Sensor Fusion

Federico Oliva,Tom Shaked

Abstract
Efficient observer design and accurate sensor fusion are key in state estimation. This work proposes an optimization-based methodology, termed Trajectory Based Optimization Design (TBOD), allowing the user to easily design observers for general nonlinear systems and multi-sensor setups. Starting from parametrized observer dynamics, the proposed method considers a finite set of pre-recorded measurement trajectories from the nominal plant and exploits them to tune the observer parameters through numerical optimization. This research hinges on the classic observer's theory and Moving Horizon Estimators methodology. Optimization is exploited to ease the observer's design, providing the user with a lightweight, general-purpose sensor fusion methodology. TBOD's main characteristics are the capability to handle general sensors efficiently and in a modular way and, most importantly, its straightforward tuning procedure. The TBOD's performance is tested on a terrestrial rover localization problem, combining IMU and ranging sensors provided by Ultra Wide Band antennas, and validated through a motion-capture system. Comparison with an Extended Kalman Filter is also provided, matching its position estimation accuracy and significantly improving in the orientation.

Help us improve your experience!