TY - GEN
T1 - Multi-Modal Domain Generalization for Cross-Scene Hyperspectral Image Classification
AU - Zhang, Yuxiang
AU - Zhang, Mengmeng
AU - Li, Wei
AU - Tao, Ran
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The large-scale pre-training image-text foundation models have excelled in a number of downstream applications. The majority of domain generalization techniques, however, have never focused on mining linguistic modal knowledge to enhance model generalization performance. Additionally, text information has been ignored in hyperspectral image classification (HSI) tasks. To address the aforementioned shortcomings, a Multi-modal Domain Generalization Network (MDG) is proposed to learn cross-domain invariant representation from cross-domain shared semantic space. Only the source domain (SD) is used for training in the proposed method, after which the model is directly transferred to the target domain (TD). Visual and linguistic features are extracted using the dual-stream architecture, which consists of an image encoder and a text encoder. A generator is designed to obtain extended domain (ED) samples that are different from SD. Furthermore, linguistic features are used to construct a cross-domain shared semantic space, where visual-linguistic alignment is accomplished by supervised contrastive learning. Extensive experiments on two datasets show that the proposed method outperforms state-of-the-art approaches.
AB - The large-scale pre-training image-text foundation models have excelled in a number of downstream applications. The majority of domain generalization techniques, however, have never focused on mining linguistic modal knowledge to enhance model generalization performance. Additionally, text information has been ignored in hyperspectral image classification (HSI) tasks. To address the aforementioned shortcomings, a Multi-modal Domain Generalization Network (MDG) is proposed to learn cross-domain invariant representation from cross-domain shared semantic space. Only the source domain (SD) is used for training in the proposed method, after which the model is directly transferred to the target domain (TD). Visual and linguistic features are extracted using the dual-stream architecture, which consists of an image encoder and a text encoder. A generator is designed to obtain extended domain (ED) samples that are different from SD. Furthermore, linguistic features are used to construct a cross-domain shared semantic space, where visual-linguistic alignment is accomplished by supervised contrastive learning. Extensive experiments on two datasets show that the proposed method outperforms state-of-the-art approaches.
KW - Contrastive Learning
KW - Cross-Scene
KW - Domain Generalization
KW - Hyperspectral Image Classification
KW - Multiple-modality
KW - Natural Language Supervision
UR - http://www.scopus.com/inward/record.url?scp=85177596915&partnerID=8YFLogxK
U2 - 10.1109/ICASSP49357.2023.10095723
DO - 10.1109/ICASSP49357.2023.10095723
M3 - Conference contribution
AN - SCOPUS:85177596915
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Y2 - 4 June 2023 through 10 June 2023
ER -