Visual Instruction Tuning for Holistic and Regional Remote Sensing Imagery Comprehension

Wei Zhang*, Miaoxin Cai, Tong Zhang, Zhuang Yin, Xuerui Mao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recently, the visual instruction multimodal large language models (MLLMs) have been extensively studied in the nature scenario. However, current remote sensing (RS) MLLMs mainly focus on image-level understanding and typically allow interaction only through text instructions, resulting in limited accuracy and efficiency. To address those limitations, a visual instruction model is proposed in this article to extend MLLM's fine-grained perception ability by incorporating bounding boxes with language instruction, aiming at achieving region-level visual understanding. To achieve this goal, a visual instruction dataset featuring multi-modal box-based region-text pairs is constructed. Furthermore, the visual instructions and images are encoded by the different encoders and subsequently fed into a large language model (LLM) along with text instruction tokens for tuning. Finally, experimental results demonstrate the proposed model's promising performance in region-level image comprehension.

Original languageEnglish
Title of host publicationIEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331515669
DOIs
Publication statusPublished - 2024
Event2nd IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024 - Zhuhai, China
Duration: 22 Nov 202424 Nov 2024

Publication series

NameIEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024

Conference

Conference2nd IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
Country/TerritoryChina
CityZhuhai
Period22/11/2424/11/24

Keywords

  • Multimodal large language models
  • Remote sensing
  • Visual instruction

Fingerprint

Dive into the research topics of 'Visual Instruction Tuning for Holistic and Regional Remote Sensing Imagery Comprehension'. Together they form a unique fingerprint.

Cite this