U-Shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection

Yongjing Cui; He Chen; Shan Dong; Guanqun Wang; Yin Zhuang

doi:10.1109/JSTARS.2024.3408604

U-Shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection

Yongjing Cui, He Chen^*, Shan Dong, Guanqun Wang, Yin Zhuang^*

^*此作品的通讯作者

信息与电子学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

Building change detection (BCD) aims to identify new or disappeared buildings from bitemporal images. However, the varied scales and appearances of buildings, along with the challenge of pseudochange interference from complex backgrounds, make it difficult to accurately extract complete changes. To address these challenges in BCD, a U-shaped hybrid Siamese network combining a convolutional neural network and a vision transformer (CNN-ViT) with learnable mask guidance, called U-Conformer, is designed. First a new hybrid architecture of U-Conformer is proposed. The architecture integrates the strengths of CNNs and ViTs to establish a robust, multiscale heterogeneous representation that aids in detecting buildings of various sizes. Second, a learnable mask guidance module is specifically designed for U-Conformer, focusing the multiscale heterogeneous representation on extracting relevant scale changes while progressively suppressing pseudochanges. Furthermore, for the U-Conformer architecture, a mask information joint class-balanced loss function that combines the binary cross-entropy loss function and the dice loss function is devised, significantly mitigating the issue of class imbalance. Experimental results on three publicly available change detection datasets, LEVIR-CD, WHU-CD, and GZ-CD, demonstrate that U-Conformer surpasses previous methods, achieving F1 scores of 91.5%, 94.6%, and 86.7%, as well as IoU scores of 84.3%, 89.7%, and 76.5% on the LEVIR-CD, WHU-CD, and GZ-CD datasets, respectively.

源语言	英语
页（从-至）	11402-11418
页数	17
期刊	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
卷	17
DOI	https://doi.org/10.1109/JSTARS.2024.3408604
出版状态	已出版 - 2024

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1109/JSTARS.2024.3408604

其它文件与链接

链接到 Scopus 的出版物

引用此

Cui, Y., Chen, H., Dong, S., Wang, G., & Zhuang, Y. (2024). U-Shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17, 11402-11418. https://doi.org/10.1109/JSTARS.2024.3408604

@article{f58d9492dd234f128f71396d6beaa76a,

title = "U-Shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection",

abstract = "Building change detection (BCD) aims to identify new or disappeared buildings from bitemporal images. However, the varied scales and appearances of buildings, along with the challenge of pseudochange interference from complex backgrounds, make it difficult to accurately extract complete changes. To address these challenges in BCD, a U-shaped hybrid Siamese network combining a convolutional neural network and a vision transformer (CNN-ViT) with learnable mask guidance, called U-Conformer, is designed. First a new hybrid architecture of U-Conformer is proposed. The architecture integrates the strengths of CNNs and ViTs to establish a robust, multiscale heterogeneous representation that aids in detecting buildings of various sizes. Second, a learnable mask guidance module is specifically designed for U-Conformer, focusing the multiscale heterogeneous representation on extracting relevant scale changes while progressively suppressing pseudochanges. Furthermore, for the U-Conformer architecture, a mask information joint class-balanced loss function that combines the binary cross-entropy loss function and the dice loss function is devised, significantly mitigating the issue of class imbalance. Experimental results on three publicly available change detection datasets, LEVIR-CD, WHU-CD, and GZ-CD, demonstrate that U-Conformer surpasses previous methods, achieving F1 scores of 91.5%, 94.6%, and 86.7%, as well as IoU scores of 84.3%, 89.7%, and 76.5% on the LEVIR-CD, WHU-CD, and GZ-CD datasets, respectively.",

keywords = "Building change detection (BCD), U-shaped convolutional-neural-network-vision-transformer (CNN-ViT), learnable mask guidance, multiscale representation, remote sensing",

author = "Yongjing Cui and He Chen and Shan Dong and Guanqun Wang and Yin Zhuang",

note = "Publisher Copyright: {\textcopyright} 2008-2012 IEEE.",

year = "2024",

doi = "10.1109/JSTARS.2024.3408604",

language = "English",

volume = "17",

pages = "11402--11418",

journal = "IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing",

issn = "1939-1404",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - U-Shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection

AU - Cui, Yongjing

AU - Chen, He

AU - Dong, Shan

AU - Wang, Guanqun

AU - Zhuang, Yin

PY - 2024

Y1 - 2024

N2 - Building change detection (BCD) aims to identify new or disappeared buildings from bitemporal images. However, the varied scales and appearances of buildings, along with the challenge of pseudochange interference from complex backgrounds, make it difficult to accurately extract complete changes. To address these challenges in BCD, a U-shaped hybrid Siamese network combining a convolutional neural network and a vision transformer (CNN-ViT) with learnable mask guidance, called U-Conformer, is designed. First a new hybrid architecture of U-Conformer is proposed. The architecture integrates the strengths of CNNs and ViTs to establish a robust, multiscale heterogeneous representation that aids in detecting buildings of various sizes. Second, a learnable mask guidance module is specifically designed for U-Conformer, focusing the multiscale heterogeneous representation on extracting relevant scale changes while progressively suppressing pseudochanges. Furthermore, for the U-Conformer architecture, a mask information joint class-balanced loss function that combines the binary cross-entropy loss function and the dice loss function is devised, significantly mitigating the issue of class imbalance. Experimental results on three publicly available change detection datasets, LEVIR-CD, WHU-CD, and GZ-CD, demonstrate that U-Conformer surpasses previous methods, achieving F1 scores of 91.5%, 94.6%, and 86.7%, as well as IoU scores of 84.3%, 89.7%, and 76.5% on the LEVIR-CD, WHU-CD, and GZ-CD datasets, respectively.

AB - Building change detection (BCD) aims to identify new or disappeared buildings from bitemporal images. However, the varied scales and appearances of buildings, along with the challenge of pseudochange interference from complex backgrounds, make it difficult to accurately extract complete changes. To address these challenges in BCD, a U-shaped hybrid Siamese network combining a convolutional neural network and a vision transformer (CNN-ViT) with learnable mask guidance, called U-Conformer, is designed. First a new hybrid architecture of U-Conformer is proposed. The architecture integrates the strengths of CNNs and ViTs to establish a robust, multiscale heterogeneous representation that aids in detecting buildings of various sizes. Second, a learnable mask guidance module is specifically designed for U-Conformer, focusing the multiscale heterogeneous representation on extracting relevant scale changes while progressively suppressing pseudochanges. Furthermore, for the U-Conformer architecture, a mask information joint class-balanced loss function that combines the binary cross-entropy loss function and the dice loss function is devised, significantly mitigating the issue of class imbalance. Experimental results on three publicly available change detection datasets, LEVIR-CD, WHU-CD, and GZ-CD, demonstrate that U-Conformer surpasses previous methods, achieving F1 scores of 91.5%, 94.6%, and 86.7%, as well as IoU scores of 84.3%, 89.7%, and 76.5% on the LEVIR-CD, WHU-CD, and GZ-CD datasets, respectively.

KW - Building change detection (BCD)

KW - U-shaped convolutional-neural-network-vision-transformer (CNN-ViT)

KW - learnable mask guidance

KW - multiscale representation

KW - remote sensing

UR - http://www.scopus.com/inward/record.url?scp=85195365289&partnerID=8YFLogxK

U2 - 10.1109/JSTARS.2024.3408604

DO - 10.1109/JSTARS.2024.3408604

M3 - Article

AN - SCOPUS:85195365289

SN - 1939-1404

VL - 17

SP - 11402

EP - 11418

JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ER -

U-Shaped CNN-ViT Siamese Network with Learnable Mask Guidance for Remote Sensing Building Change Detection

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此