A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery

Yutong Liu, Kun Gao*, Hong Wang, Zhijia Yang, Pengyu Wang, Shijing Ji, Yanjun Huang, Zhenyu Zhu, Xiaobin Zhao

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.

源语言英语
文章编号104083
期刊International Journal of Applied Earth Observation and Geoinformation
133
DOI
出版状态已出版 - 9月 2024

指纹

探究 'A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery' 的科研主题。它们共同构成独一无二的指纹。

引用此