TY - JOUR
T1 - Soft-Guided Open-Vocabulary Semantic Segmentation of Remote Sensing Images
AU - An, Ke
AU - Wang, Yupei
AU - Chen, Liang
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Open-vocabulary remote sensing (RS) semantic segmentation strives to assign both seen and unseen class labels to individual pixels in RS images. Existing models follow the “fine-tune” paradigm based on vision-language models (VLMs). However, as VLMs are predominantly tailored to natural scenes, these directly fine-tuned models often collapse into the seen categories and show insensitivity in perceiving RS semantic cues. This critical issue of model collapse is closely related to the misalignment between image and text, making them struggle with the unique challenges of RS images, such as complex and diverse scenes, and objects with significant scale differences. To this end, we propose a soft-guided open-vocabulary RS semantic segmentation framework, which is the first to explore how to softly adapt VLMs to the downstream task of semantic segmentation for RS images. Concretely, instead of directly fine-tuning, we introduce a generalization compensation strategy, which employs an additional frozen VLM encoder to provide implicit semantic guidance for dynamic optimization of visual representation. By introducing prior knowledge from the frozen encoder, this soft strategy compensates potential losses incurred during fine-tuning, thus enhancing the model’s pixel-level perceptual alignment while avoiding model collapse. Afterward, to optimize the sensitivity of VLMs’ textual and visual embeddings to RS semantic information, bias-guided image–text collaborative optimization is presented to achieve a bilateral interaction of semantic information with the guidance of RS scenes’ Bias. Finally, an improved upsampling decoder is employed to obtain the progressive refinement and calibration of the cost map through the integration of multiscale information and textual embeddings. Extensive experiments demonstrate that our method achieves state-of-the-art performance on widely used challenging benchmarks.
AB - Open-vocabulary remote sensing (RS) semantic segmentation strives to assign both seen and unseen class labels to individual pixels in RS images. Existing models follow the “fine-tune” paradigm based on vision-language models (VLMs). However, as VLMs are predominantly tailored to natural scenes, these directly fine-tuned models often collapse into the seen categories and show insensitivity in perceiving RS semantic cues. This critical issue of model collapse is closely related to the misalignment between image and text, making them struggle with the unique challenges of RS images, such as complex and diverse scenes, and objects with significant scale differences. To this end, we propose a soft-guided open-vocabulary RS semantic segmentation framework, which is the first to explore how to softly adapt VLMs to the downstream task of semantic segmentation for RS images. Concretely, instead of directly fine-tuning, we introduce a generalization compensation strategy, which employs an additional frozen VLM encoder to provide implicit semantic guidance for dynamic optimization of visual representation. By introducing prior knowledge from the frozen encoder, this soft strategy compensates potential losses incurred during fine-tuning, thus enhancing the model’s pixel-level perceptual alignment while avoiding model collapse. Afterward, to optimize the sensitivity of VLMs’ textual and visual embeddings to RS semantic information, bias-guided image–text collaborative optimization is presented to achieve a bilateral interaction of semantic information with the guidance of RS scenes’ Bias. Finally, an improved upsampling decoder is employed to obtain the progressive refinement and calibration of the cost map through the integration of multiscale information and textual embeddings. Extensive experiments demonstrate that our method achieves state-of-the-art performance on widely used challenging benchmarks.
KW - Fine-tune
KW - open-vocabulary segmentation
KW - remote sensing (RS)
KW - semantic segmentation
UR - https://www.scopus.com/pages/publications/105020944645
U2 - 10.1109/TGRS.2025.3628336
DO - 10.1109/TGRS.2025.3628336
M3 - Article
AN - SCOPUS:105020944645
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5652216
ER -