Multimodal Prompt-Guided Bidirectional Fusion for Referring Remote Sensing Image Segmentation

Yingjie Li, Weiqi Jin*, Su Qiu, Qiyang Sun

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal feature alignment is a key challenge in referring remote sensing image segmentation (RRSIS). The complex spatial relationships and multi-scale targets in remote sensing images call for efficient cross-modal mapping and fine-grained feature alignment. Existing approaches typically rely on cross-attention for multimodal fusion, which increases model complexity. To address this, we introduce the concept of prompt learning in RRSIS and propose a parameter-efficient multimodal prompt-guided bidirectional fusion (MPBF) architecture. MPBF combines both early and late fusion strategies. In the early fusion stage, it conducts the deep fusion of linguistic and visual features through cross-modal prompt coupling. In the late fusion stage, to handle the multi-scale nature of remote sensing targets, a scale refinement module is proposed to capture diverse scale representations, and a vision–language alignment module is employed for pixel-level multimodal semantic associations. Comparative experiments and ablation studies on a public dataset demonstrate that MPBF significantly outperformed existing state-of-the-art methods with relatively small computational overhead, highlighting its effectiveness and efficiency for RRSIS. Further application experiments on a custom dataset confirm the method’s practicality and robustness in real-world scenarios.

Original languageEnglish
Article number1683
JournalRemote Sensing
Volume17
Issue number10
DOIs
Publication statusPublished - May 2025
Externally publishedYes

Keywords

  • bidirectional fusion
  • prompt learning
  • referring image segmentation
  • remote sensing

Fingerprint

Dive into the research topics of 'Multimodal Prompt-Guided Bidirectional Fusion for Referring Remote Sensing Image Segmentation'. Together they form a unique fingerprint.

Cite this