Xian Jiaotong University
Abstract
Automated textual description of remote sensing images is crucial for
unlocking their full potential in diverse applications, from environmental
monitoring to urban planning and disaster management. However, existing studies
in remote sensing image captioning primarily focus on the image level, lacking
object-level fine-grained interpretation, which prevents the full utilization
and transformation of the rich semantic and structural information contained in
remote sensing images. To address this limitation, we propose Geo-DLC, a novel
task of object-level fine-grained image captioning for remote sensing. To
support this task, we construct DE-Dataset, a large-scale dataset contains 25
categories and 261,806 annotated instances with detailed descriptions of object
attributes, relationships, and contexts. Furthermore, we introduce
DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed
to systematically measure model capabilities on the Geo-DLC task. We also
present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture
explicitly designed for Geo-DLC, which integrates a scale-adaptive focal
strategy and a domain-guided fusion module leveraging remote sensing
vision-language model features to encode high-resolution details and remote
sensing category priors while maintaining global context. Our DescribeEarth
model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark,
demonstrating superior factual accuracy, descriptive richness, and grammatical
soundness, particularly in capturing intrinsic object features and surrounding
environmental attributes across simple, complex, and even out-of-distribution
remote sensing scenarios. All data, code and weights are released at
https://github.com/earth-insights/DescribeEarth.
Institute of Mathematical
Abstract
The problem of classification in machine learning has often been approached
in terms of function approximation. In this paper, we propose an alternative
approach for classification in arbitrary compact metric spaces which, in
theory, yields both the number of classes, and a perfect classification using a
minimal number of queried labels. Our approach uses localized trigonometric
polynomial kernels initially developed for the point source signal separation
problem in signal processing. Rather than point sources, we argue that the
various classes come from different probability distributions. The localized
kernel technique developed for separating point sources is then shown to
separate the supports of these distributions. This is done in a hierarchical
manner in our MASC algorithm to accommodate touching/overlapping class
boundaries. We illustrate our theory on several simulated and real life
datasets, including the Salinas and Indian Pines hyperspectral datasets and a
document dataset.