跳到主要导航 跳到搜索 跳到主要内容

EarthGPT-X: A Spatial MLLM for Multilevel Multisource Remote Sensing Imagery Understanding With Visual Prompting

  • Beijing Institute of Technology
  • Nanyang Technological University
  • China University of Geosciences, Wuhan
  • State Key Laboratory of Explosion Science and Technology

科研成果: 期刊稿件文章同行评审

摘要

Recent advances in natural-domain multimodal large language models (MLLMs) have demonstrated effective spatial reasoning through visual and textual prompting. However, their direct transfer to remote sensing (RS) is hindered by heterogeneous sensing physics, diverse modalities, and unique spatial scales. Existing RS MLLMs are mainly limited to optical imagery and plain language interaction, preventing flexible and scalable real-world applications. In this article, EarthGPT-X is proposed, the first flexible spatial MLLM that unifies multisource RS imagery comprehension and accomplishes both coarse-grained and fine-grained visual tasks under diverse visual prompts in a single framework. Distinct from prior models, EarthGPT-X introduces the following: 1) a dual-prompt mechanism combining text instructions with various visual prompts (i.e., point, box, and free-form) to mimic the versatility of referring in human life; 2) a comprehensive multisource multilevel prompting dataset, the model advances beyond holistic image understanding to support hierarchical spatial reasoning, including scene-level understanding and fine-grained object attributes and relational analysis; and 3) a cross-domain one-stage fusion training strategy, enabling efficient and consistent alignment across modalities and tasks. Extensive experiments demonstrate that EarthGPT-X substantially outperforms prior natural and RS MLLMs, establishing the first framework capable of multisource, multitask, and multilevel interpretation using visual prompting in RS scenarios. The code and dataset are available at https://github.com/wivizhang/EarthGPT-X.

源语言英语
文章编号4709221
期刊IEEE Transactions on Geoscience and Remote Sensing
63
DOI
出版状态已出版 - 2025

指纹

探究 'EarthGPT-X: A Spatial MLLM for Multilevel Multisource Remote Sensing Imagery Understanding With Visual Prompting' 的科研主题。它们共同构成独一无二的指纹。

引用此