EarthGPT-X: A Spatial MLLM for Multilevel Multisource Remote Sensing Imagery Understanding With Visual Prompting

  • Wei Zhang
  • , Miaoxin Cai
  • , Yaqian Ning
  • , Tong Zhang
  • , Yin Zhuang*
  • , Shijian Lu*
  • , He Chen
  • , Jun Li
  • , Xuerui Mao*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advances in natural-domain multimodal large language models (MLLMs) have demonstrated effective spatial reasoning through visual and textual prompting. However, their direct transfer to remote sensing (RS) is hindered by heterogeneous sensing physics, diverse modalities, and unique spatial scales. Existing RS MLLMs are mainly limited to optical imagery and plain language interaction, preventing flexible and scalable real-world applications. In this article, EarthGPT-X is proposed, the first flexible spatial MLLM that unifies multisource RS imagery comprehension and accomplishes both coarse-grained and fine-grained visual tasks under diverse visual prompts in a single framework. Distinct from prior models, EarthGPT-X introduces the following: 1) a dual-prompt mechanism combining text instructions with various visual prompts (i.e., point, box, and free-form) to mimic the versatility of referring in human life; 2) a comprehensive multisource multilevel prompting dataset, the model advances beyond holistic image understanding to support hierarchical spatial reasoning, including scene-level understanding and fine-grained object attributes and relational analysis; and 3) a cross-domain one-stage fusion training strategy, enabling efficient and consistent alignment across modalities and tasks. Extensive experiments demonstrate that EarthGPT-X substantially outperforms prior natural and RS MLLMs, establishing the first framework capable of multisource, multitask, and multilevel interpretation using visual prompting in RS scenarios. The code and dataset are available at https://github.com/wivizhang/EarthGPT-X.

Original languageEnglish
Article number4709221
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume63
DOIs
Publication statusPublished - 2025

Keywords

  • Multimodal large language models (MLLMs)
  • multisource
  • remote sensing (RS)
  • spatial reasoning

Fingerprint

Dive into the research topics of 'EarthGPT-X: A Spatial MLLM for Multilevel Multisource Remote Sensing Imagery Understanding With Visual Prompting'. Together they form a unique fingerprint.

Cite this