University of Central FL
Abstract
Vision-language models (VLMs) have advanced rapidly, yet their capacity for
image-grounded geolocation in open-world conditions, a task that is challenging
and of demand in real life, has not been comprehensively evaluated. We present
EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates
visual recognition, step-by-step reasoning, and evidence use. EarthWhere
comprises 810 globally distributed images across two complementary geolocation
scales: WhereCountry (i.e., 500 multiple-choice question-answering, with
country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained
street-level identification tasks requiring multi-step reasoning with optional
web search). For evaluation, we adopt the final-prediction metrics: location
accuracies within k km (Acc@k) for coordinates and hierarchical path scores for
textual localization. Beyond this, we propose to explicitly score intermediate
reasoning chains using human-verified key visual clues and a Shapley-reweighted
thinking score that attributes credit to each clue's marginal contribution. We
benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere
and report different types of final answer accuracies as well as the calibrated
model thinking scores. Overall, Gemini-2.5-Pro achieves the best average
accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches
34.71%. We reveal that web search and reasoning do not guarantee improved
performance when visual clues are limited, and models exhibit regional biases,
achieving up to 42.7% higher scores in certain areas than others. These
findings highlight not only the promise but also the persistent challenges of
models to mitigate bias and achieve robust, fine-grained localization. We
open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.