TY - JOUR
T1 - EarthGPT-X
T2 - A Spatial MLLM for Multilevel Multisource Remote Sensing Imagery Understanding With Visual Prompting
AU - Zhang, Wei
AU - Cai, Miaoxin
AU - Ning, Yaqian
AU - Zhang, Tong
AU - Zhuang, Yin
AU - Lu, Shijian
AU - Chen, He
AU - Li, Jun
AU - Mao, Xuerui
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent advances in natural-domain multimodal large language models (MLLMs) have demonstrated effective spatial reasoning through visual and textual prompting. However, their direct transfer to remote sensing (RS) is hindered by heterogeneous sensing physics, diverse modalities, and unique spatial scales. Existing RS MLLMs are mainly limited to optical imagery and plain language interaction, preventing flexible and scalable real-world applications. In this article, EarthGPT-X is proposed, the first flexible spatial MLLM that unifies multisource RS imagery comprehension and accomplishes both coarse-grained and fine-grained visual tasks under diverse visual prompts in a single framework. Distinct from prior models, EarthGPT-X introduces the following: 1) a dual-prompt mechanism combining text instructions with various visual prompts (i.e., point, box, and free-form) to mimic the versatility of referring in human life; 2) a comprehensive multisource multilevel prompting dataset, the model advances beyond holistic image understanding to support hierarchical spatial reasoning, including scene-level understanding and fine-grained object attributes and relational analysis; and 3) a cross-domain one-stage fusion training strategy, enabling efficient and consistent alignment across modalities and tasks. Extensive experiments demonstrate that EarthGPT-X substantially outperforms prior natural and RS MLLMs, establishing the first framework capable of multisource, multitask, and multilevel interpretation using visual prompting in RS scenarios. The code and dataset are available at https://github.com/wivizhang/EarthGPT-X.
AB - Recent advances in natural-domain multimodal large language models (MLLMs) have demonstrated effective spatial reasoning through visual and textual prompting. However, their direct transfer to remote sensing (RS) is hindered by heterogeneous sensing physics, diverse modalities, and unique spatial scales. Existing RS MLLMs are mainly limited to optical imagery and plain language interaction, preventing flexible and scalable real-world applications. In this article, EarthGPT-X is proposed, the first flexible spatial MLLM that unifies multisource RS imagery comprehension and accomplishes both coarse-grained and fine-grained visual tasks under diverse visual prompts in a single framework. Distinct from prior models, EarthGPT-X introduces the following: 1) a dual-prompt mechanism combining text instructions with various visual prompts (i.e., point, box, and free-form) to mimic the versatility of referring in human life; 2) a comprehensive multisource multilevel prompting dataset, the model advances beyond holistic image understanding to support hierarchical spatial reasoning, including scene-level understanding and fine-grained object attributes and relational analysis; and 3) a cross-domain one-stage fusion training strategy, enabling efficient and consistent alignment across modalities and tasks. Extensive experiments demonstrate that EarthGPT-X substantially outperforms prior natural and RS MLLMs, establishing the first framework capable of multisource, multitask, and multilevel interpretation using visual prompting in RS scenarios. The code and dataset are available at https://github.com/wivizhang/EarthGPT-X.
KW - Multimodal large language models (MLLMs)
KW - multisource
KW - remote sensing (RS)
KW - spatial reasoning
UR - https://www.scopus.com/pages/publications/105020734937
U2 - 10.1109/TGRS.2025.3626941
DO - 10.1109/TGRS.2025.3626941
M3 - Article
AN - SCOPUS:105020734937
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 4709221
ER -