ReSaP: Reasoning-Enhanced and Scale-Aware Prompting for Referring Remote Sensing Image Segmentation

  • Ning Lv
  • , Jisheng Dang*
  • , Teng Wang
  • , Bimei Wang
  • , Yichu Liu
  • , Hong Peng
  • , Haowen Yan
  • , Bin Hu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Recent research has actively explored diverse mechanisms to unlock pixel-level segmentation capabilities in Multimodal Large Language Models (MLLMs), aiming to bridge the gap between high-level semantic reasoning and fine-grained visual perception. However, directly transferring these general-domain frameworks to Referring Remote Sensing Image Segmentation (RRSIS) faces significant hurdles. These challenges primarily stem from the weak pixel-level discrimination capability of MLLMs in complex geospatial scenes and the severe granularity mismatch caused by drastic scale variations in remote sensing targets. To overcome these limitations, this paper proposes ReSaP, a Reasoning-enhanced and Scale-aware Prompting framework. ReSaP incorporates two core components to effectively adapt MLLMs for pixel- wise tasks. First, we introduce a Pixel-Aware GRPO training scheme. By utilizing a reinforcement learning framework with a hybrid reward mechanism that integrates bipartite matching for localization and classification accuracy for verification, this scheme explicitly enhances the MLLM's fine-grained pixel discrimination and localization precision. Second, we propose the Scale-Aware Prompting strategy for inference. This mechanism employs a density-adaptive grid sampling approach to dynamically adjust the prompt configuration based on target dimensions, effectively harmonizing prompt granularity with object scale. Extensive experiments on the RRSIS-D and RIS-LAD benchmarks demonstrate that ReSaP significantly outperforms existing state-of-the-art methods, validating its superior performance and robustness across both satellite and unmanned aerial vehicle (UAV) observation perspectives.

Original languageEnglish
JournalIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
DOIs
Publication statusAccepted/In press - 2026
Externally publishedYes

Keywords

  • Multimodal large language models
  • reasoning prompting
  • referring remote sensing image segmentation
  • reinforcement learning
  • visual grounding

Fingerprint

Dive into the research topics of 'ReSaP: Reasoning-Enhanced and Scale-Aware Prompting for Referring Remote Sensing Image Segmentation'. Together they form a unique fingerprint.

Cite this