Soft-Guided Open-Vocabulary Semantic Segmentation of Remote Sensing Images

Research output: Contribution to journalArticlepeer-review

Abstract

Open-vocabulary remote sensing (RS) semantic segmentation strives to assign both seen and unseen class labels to individual pixels in RS images. Existing models follow the “fine-tune” paradigm based on vision-language models (VLMs). However, as VLMs are predominantly tailored to natural scenes, these directly fine-tuned models often collapse into the seen categories and show insensitivity in perceiving RS semantic cues. This critical issue of model collapse is closely related to the misalignment between image and text, making them struggle with the unique challenges of RS images, such as complex and diverse scenes, and objects with significant scale differences. To this end, we propose a soft-guided open-vocabulary RS semantic segmentation framework, which is the first to explore how to softly adapt VLMs to the downstream task of semantic segmentation for RS images. Concretely, instead of directly fine-tuning, we introduce a generalization compensation strategy, which employs an additional frozen VLM encoder to provide implicit semantic guidance for dynamic optimization of visual representation. By introducing prior knowledge from the frozen encoder, this soft strategy compensates potential losses incurred during fine-tuning, thus enhancing the model’s pixel-level perceptual alignment while avoiding model collapse. Afterward, to optimize the sensitivity of VLMs’ textual and visual embeddings to RS semantic information, bias-guided image–text collaborative optimization is presented to achieve a bilateral interaction of semantic information with the guidance of RS scenes’ Bias. Finally, an improved upsampling decoder is employed to obtain the progressive refinement and calibration of the cost map through the integration of multiscale information and textual embeddings. Extensive experiments demonstrate that our method achieves state-of-the-art performance on widely used challenging benchmarks.

Original languageEnglish
Article number5652216
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume63
DOIs
Publication statusPublished - 2025

Keywords

  • Fine-tune
  • open-vocabulary segmentation
  • remote sensing (RS)
  • semantic segmentation

Fingerprint

Dive into the research topics of 'Soft-Guided Open-Vocabulary Semantic Segmentation of Remote Sensing Images'. Together they form a unique fingerprint.

Cite this