TY - GEN
T1 - A Dual-Branch Network Based on ViT and Mamba for Semantic Segmentation of Remote Sensing Image
AU - An, Ke
AU - Wang, Ying
AU - Chen, Liang
AU - Wang, Yupie
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Semantic segmentation of remote sensing images has significant applications across various scenarios. The prevailing frameworks include Convolutional Neural Network (CNN) and Transformer. However, CNN is limited by the receptive field of convolutions, while the Transformer is constrained by computational complexity, which restricts attention calculations to local windows and fails to effectively address long-range dependency modeling. The efficient Mamba architecture, characterized by linear complexity, offers a promising solution to these challenges. Inspired by Mamba, we propose a dual-branch network based on ViT and Mamba. The Vision Transformer (ViT) branch incorporates the Swin Transformer to model spatial details while maintaining computational complexity within acceptable bounds. Complementarily, the Mamba branch efficiently captures global context and long-range dependencies. Additionally, to suppress noise and conflicting information arising from the fusion of features from different frameworks, we design the Cross-Model Fusion Module (CMFM) and the Cross-Model Relevance Loss (CMRLoss) to achieve semantic consistency in the fusion process. The comprehensive experimental findings on the commonly utilized GaoFen-2 and iSAID datasets clearly illustrate the advantages of our proposed approach compared to the leading methods in the field.
AB - Semantic segmentation of remote sensing images has significant applications across various scenarios. The prevailing frameworks include Convolutional Neural Network (CNN) and Transformer. However, CNN is limited by the receptive field of convolutions, while the Transformer is constrained by computational complexity, which restricts attention calculations to local windows and fails to effectively address long-range dependency modeling. The efficient Mamba architecture, characterized by linear complexity, offers a promising solution to these challenges. Inspired by Mamba, we propose a dual-branch network based on ViT and Mamba. The Vision Transformer (ViT) branch incorporates the Swin Transformer to model spatial details while maintaining computational complexity within acceptable bounds. Complementarily, the Mamba branch efficiently captures global context and long-range dependencies. Additionally, to suppress noise and conflicting information arising from the fusion of features from different frameworks, we design the Cross-Model Fusion Module (CMFM) and the Cross-Model Relevance Loss (CMRLoss) to achieve semantic consistency in the fusion process. The comprehensive experimental findings on the commonly utilized GaoFen-2 and iSAID datasets clearly illustrate the advantages of our proposed approach compared to the leading methods in the field.
KW - Mamba
KW - Remote Sensing Image
KW - Semantic Segmentation
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=86000032325&partnerID=8YFLogxK
U2 - 10.1109/ICSIDP62679.2024.10869220
DO - 10.1109/ICSIDP62679.2024.10869220
M3 - Conference contribution
AN - SCOPUS:86000032325
T3 - IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
BT - IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
Y2 - 22 November 2024 through 24 November 2024
ER -