跳到主要导航 跳到搜索 跳到主要内容

Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

  • Shan Dong
  • , Jianlin Xie
  • , Liang Chen
  • , He Chen*
  • , Baogui Qi
  • , Yunqiu Ge
  • *此作品的通讯作者
  • Beijing Institute of Technology
  • Tsinghua University

科研成果: 期刊稿件文章同行评审

摘要

Referring Remote Sensing Image Segmentation (RRSIS) is a representative multimodal understanding task for remote sensing, which segments designated targets from remote images according to free-form natural language descriptions. However, complex remote sensing characteristics, such as cluttered backgrounds, large-scale variations, small scattered targets and repetitive textures, lead to unstable visual grounding and further spatial grounding drift, resulting in inaccurate segmentation results. Existing approaches typically perform implicit visual–linguistic fusion across encoding and decoding stages, entangling spatial grounding with mask refinement. This tightly coupled formulation lacks explicit structural constraints and is prone to cross-modal ambiguity, especially in complex remote sensing layouts. To address these limitations, we propose a Structurally consistent and Grounding-aware Stagewise Reasoning Framework (SGSRF) that follows a grounding-first, segmentation-second paradigm. The framework decomposes inference into three cascaded stages with progressively imposed structural constraints. First, Cross-modal Consistency Refinement (CCR) lays the foundation for stable spatial grounding by enhancing visual–textual structural alignment via CLIP-based features and Structural Consistency Regularization (SCR), producing well-aligned multimodal representations and reliable grounding cues. Second, Grounding-aware Prompt (GPG) Generation bridges grounding and segmentation by converting aligned representations into complementary sparse and dense prompts, which serve as explicit grounding guidance for the segmentation model. Third, Grounding Modulated Segmentation (GMS) leverages the Segment Anything Model (SAM) to generate fine-grained mask prediction under the joint guidance of prompts and grounding cues, improving spatial grounding stability and robustness to background interference and scale variation. Extensive experiments on three remote sensing benchmarks, namely RefSegRS, RRSIS-D, and RISBench, demonstrate that SGSRF achieves state-of-the-art performance. The proposed stagewise paradigm integrates structural alignment, explicit grounding, and prompt-driven segmentation into a unified framework, providing a practical and robust solution for RRSIS in real-world Earth observation applications.

源语言英语
文章编号1015
期刊Remote Sensing
18
7
DOI
出版状态已出版 - 4月 2026

指纹

探究 'Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation' 的科研主题。它们共同构成独一无二的指纹。

引用此