Abstract
Remote sensing semantic segmentation plays a significant role in various applications such as environmental monitoring, land use planning, and disaster response. Convolutional neural networks (CNNs) have been dominating remote sensing semantic segmentation. However, due to the limitations of convolution operations, CNNs cannot effectively model global context. The success of transformers in the natural language processing (NLP) domain provides a new solution for global context modeling. Inspired by the Swin transformer, we propose a novel remote sensing semantic segmentation model called CSTUNet. This model employs a dual-encoder structure consisting of a CNN-based main encoder and a Swin transformer-based auxiliary encoder. We first utilize a detail-structure preservation module (DPM) to mitigate the loss of detail and structure information caused by Swin transformer downsampling. Then we introduce a spatial feature enhancement module (SFE) to collect contextual information from different spatial dimensions. Finally, we construct a position-aware attention fusion module (PAFM) to fuse contextual and local information. Our proposed model obtained 70.75% mean intersection over union (MIoU) on the ISPRS-Vaihingen dataset and 77.27% MIoU on the ISPRS-Potsdam dataset.
Original language | English |
---|---|
Article number | 5530111 |
Pages (from-to) | 1-11 |
Number of pages | 11 |
Journal | IEEE Transactions on Geoscience and Remote Sensing |
Volume | 61 |
DOIs | |
Publication status | Published - 2023 |
Keywords
- Feature fusion
- Swin transformer
- remote sensing image
- semantic segmentation