A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery

Yutong Liu; Kun Gao; Hong Wang; Zhijia Yang; Pengyu Wang; Shijing Ji; Yanjun Huang; Zhenyu Zhu; Xiaobin Zhao

doi:10.1016/j.jag.2024.104083

A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery

Yutong Liu, Kun Gao^*, Hong Wang, Zhijia Yang, Pengyu Wang, Shijing Ji, Yanjun Huang, Zhenyu Zhu, Xiaobin Zhao

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.

源语言	英语
文章编号	104083
期刊	International Journal of Applied Earth Observation and Geoinformation
卷	133
DOI	https://doi.org/10.1016/j.jag.2024.104083
出版状态	已出版 - 9月 2024

访问文件

10.1016/j.jag.2024.104083

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{2f270be05302460ca9c415c0c075684d,

title = "A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery",

abstract = "Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.",

keywords = "High-resolution remote sensing, Multi-modal fusion, Semantic segmentation, Transformer",

author = "Yutong Liu and Kun Gao and Hong Wang and Zhijia Yang and Pengyu Wang and Shijing Ji and Yanjun Huang and Zhenyu Zhu and Xiaobin Zhao",

note = "Publisher Copyright: {\textcopyright} 2024 The Authors",

year = "2024",

month = sep,

doi = "10.1016/j.jag.2024.104083",

language = "English",

volume = "133",

journal = "International Journal of Applied Earth Observation and Geoinformation",

issn = "1569-8432",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery

AU - Liu, Yutong

AU - Gao, Kun

AU - Wang, Hong

AU - Yang, Zhijia

AU - Wang, Pengyu

AU - Ji, Shijing

AU - Huang, Yanjun

AU - Zhu, Zhenyu

AU - Zhao, Xiaobin

PY - 2024/9

Y1 - 2024/9

N2 - Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.

AB - Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.

KW - High-resolution remote sensing

KW - Multi-modal fusion

KW - Semantic segmentation

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85201016247&partnerID=8YFLogxK

U2 - 10.1016/j.jag.2024.104083

DO - 10.1016/j.jag.2024.104083

M3 - Article

AN - SCOPUS:85201016247

SN - 1569-8432

VL - 133

JO - International Journal of Applied Earth Observation and Geoinformation

JF - International Journal of Applied Earth Observation and Geoinformation

M1 - 104083

ER -

A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery

摘要

访问文件

其它文件与链接

指纹

引用此