TY - JOUR
T1 - A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery
AU - Liu, Yutong
AU - Gao, Kun
AU - Wang, Hong
AU - Yang, Zhijia
AU - Wang, Pengyu
AU - Ji, Shijing
AU - Huang, Yanjun
AU - Zhu, Zhenyu
AU - Zhao, Xiaobin
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2024/9
Y1 - 2024/9
N2 - Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.
AB - Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.
KW - High-resolution remote sensing
KW - Multi-modal fusion
KW - Semantic segmentation
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85201016247&partnerID=8YFLogxK
U2 - 10.1016/j.jag.2024.104083
DO - 10.1016/j.jag.2024.104083
M3 - Article
AN - SCOPUS:85201016247
SN - 1569-8432
VL - 133
JO - International Journal of Applied Earth Observation and Geoinformation
JF - International Journal of Applied Earth Observation and Geoinformation
M1 - 104083
ER -