Skip to main navigation Skip to search Skip to main content

Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

  • Shan Dong
  • , Jianlin Xie
  • , Liang Chen
  • , He Chen*
  • , Baogui Qi
  • , Yunqiu Ge
  • *Corresponding author for this work
  • Beijing Institute of Technology
  • Tsinghua University

Research output: Contribution to journalArticlepeer-review

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a representative multimodal understanding task for remote sensing, which segments designated targets from remote images according to free-form natural language descriptions. However, complex remote sensing characteristics, such as cluttered backgrounds, large-scale variations, small scattered targets and repetitive textures, lead to unstable visual grounding and further spatial grounding drift, resulting in inaccurate segmentation results. Existing approaches typically perform implicit visual–linguistic fusion across encoding and decoding stages, entangling spatial grounding with mask refinement. This tightly coupled formulation lacks explicit structural constraints and is prone to cross-modal ambiguity, especially in complex remote sensing layouts. To address these limitations, we propose a Structurally consistent and Grounding-aware Stagewise Reasoning Framework (SGSRF) that follows a grounding-first, segmentation-second paradigm. The framework decomposes inference into three cascaded stages with progressively imposed structural constraints. First, Cross-modal Consistency Refinement (CCR) lays the foundation for stable spatial grounding by enhancing visual–textual structural alignment via CLIP-based features and Structural Consistency Regularization (SCR), producing well-aligned multimodal representations and reliable grounding cues. Second, Grounding-aware Prompt (GPG) Generation bridges grounding and segmentation by converting aligned representations into complementary sparse and dense prompts, which serve as explicit grounding guidance for the segmentation model. Third, Grounding Modulated Segmentation (GMS) leverages the Segment Anything Model (SAM) to generate fine-grained mask prediction under the joint guidance of prompts and grounding cues, improving spatial grounding stability and robustness to background interference and scale variation. Extensive experiments on three remote sensing benchmarks, namely RefSegRS, RRSIS-D, and RISBench, demonstrate that SGSRF achieves state-of-the-art performance. The proposed stagewise paradigm integrates structural alignment, explicit grounding, and prompt-driven segmentation into a unified framework, providing a practical and robust solution for RRSIS in real-world Earth observation applications.

Original languageEnglish
Article number1015
JournalRemote Sensing
Volume18
Issue number7
DOIs
Publication statusPublished - Apr 2026

Keywords

  • multimodal understanding
  • referring segmentation
  • remote sening
  • segment anything model
  • spatila grounding

Fingerprint

Dive into the research topics of 'Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation'. Together they form a unique fingerprint.

Cite this