TY - GEN
T1 - Visual Instruction Tuning for Holistic and Regional Remote Sensing Imagery Comprehension
AU - Zhang, Wei
AU - Cai, Miaoxin
AU - Zhang, Tong
AU - Yin, Zhuang
AU - Mao, Xuerui
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Recently, the visual instruction multimodal large language models (MLLMs) have been extensively studied in the nature scenario. However, current remote sensing (RS) MLLMs mainly focus on image-level understanding and typically allow interaction only through text instructions, resulting in limited accuracy and efficiency. To address those limitations, a visual instruction model is proposed in this article to extend MLLM's fine-grained perception ability by incorporating bounding boxes with language instruction, aiming at achieving region-level visual understanding. To achieve this goal, a visual instruction dataset featuring multi-modal box-based region-text pairs is constructed. Furthermore, the visual instructions and images are encoded by the different encoders and subsequently fed into a large language model (LLM) along with text instruction tokens for tuning. Finally, experimental results demonstrate the proposed model's promising performance in region-level image comprehension.
AB - Recently, the visual instruction multimodal large language models (MLLMs) have been extensively studied in the nature scenario. However, current remote sensing (RS) MLLMs mainly focus on image-level understanding and typically allow interaction only through text instructions, resulting in limited accuracy and efficiency. To address those limitations, a visual instruction model is proposed in this article to extend MLLM's fine-grained perception ability by incorporating bounding boxes with language instruction, aiming at achieving region-level visual understanding. To achieve this goal, a visual instruction dataset featuring multi-modal box-based region-text pairs is constructed. Furthermore, the visual instructions and images are encoded by the different encoders and subsequently fed into a large language model (LLM) along with text instruction tokens for tuning. Finally, experimental results demonstrate the proposed model's promising performance in region-level image comprehension.
KW - Multimodal large language models
KW - Remote sensing
KW - Visual instruction
UR - http://www.scopus.com/inward/record.url?scp=86000005220&partnerID=8YFLogxK
U2 - 10.1109/ICSIDP62679.2024.10868722
DO - 10.1109/ICSIDP62679.2024.10868722
M3 - Conference contribution
AN - SCOPUS:86000005220
T3 - IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
BT - IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
Y2 - 22 November 2024 through 24 November 2024
ER -