Combining Swin Transformer With UNet for Remote Sensing Image Semantic Segmentation

Lili Fan, Yu Zhou*, Hongmei Liu, Yunjie Li, Dongpu Cao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

26 Citations (Scopus)

Abstract

Remote sensing semantic segmentation plays a significant role in various applications such as environmental monitoring, land use planning, and disaster response. Convolutional neural networks (CNNs) have been dominating remote sensing semantic segmentation. However, due to the limitations of convolution operations, CNNs cannot effectively model global context. The success of transformers in the natural language processing (NLP) domain provides a new solution for global context modeling. Inspired by the Swin transformer, we propose a novel remote sensing semantic segmentation model called CSTUNet. This model employs a dual-encoder structure consisting of a CNN-based main encoder and a Swin transformer-based auxiliary encoder. We first utilize a detail-structure preservation module (DPM) to mitigate the loss of detail and structure information caused by Swin transformer downsampling. Then we introduce a spatial feature enhancement module (SFE) to collect contextual information from different spatial dimensions. Finally, we construct a position-aware attention fusion module (PAFM) to fuse contextual and local information. Our proposed model obtained 70.75% mean intersection over union (MIoU) on the ISPRS-Vaihingen dataset and 77.27% MIoU on the ISPRS-Potsdam dataset.

Original languageEnglish
Article number5530111
Pages (from-to)1-11
Number of pages11
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume61
DOIs
Publication statusPublished - 2023

Keywords

  • Feature fusion
  • Swin transformer
  • remote sensing image
  • semantic segmentation

Fingerprint

Dive into the research topics of 'Combining Swin Transformer With UNet for Remote Sensing Image Semantic Segmentation'. Together they form a unique fingerprint.

Cite this