用于多光谱和高光谱图像融合的联合自注意力 Transformer

Miaoyu Li; Ying Fu

doi:10.11834/jig.220954

用于多光谱和高光谱图像融合的联合自注意力 Transformer

Translated title of the contribution: Joint self-attention Transformer for multispectral and hyperspectral image fusion

Miaoyu Li, Ying Fu^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Objective Hyperspectral image(HSI)contains rich spectral information and has advantages over multispectral image(MSI)in accurately distinguishing different types of materials. Therefore,HSI has been widely used in many computer vision tasks,including vegetation detection,face recognition,and feature segmentation. However,due to the limitations in hardware equipment and the acquisition environment,an inevitable trade-off arises between spatial resolution and spectral resolution. Thus,HSIs under real scenes often have low spatial resolution,which negatively affects the performance of subsequent vision tasks. By fusing the low-resolution HSI(LR-HSI)with a high-resolution MSI(HR-MSI)under the same scene using the HSI super-resolution algorithm,the spatial resolution of HSIs can be effectively improved. Existing HSI fusion algorithms can be roughly classified into traditional-model-based and deep-learning-based methods. Traditional-model-based fusion methods employ various handcrafted shallow priors(e. g. ,matrix/tensor factorization,total variation,and low rank)to utilize the intrinsic statistics of observed spectral images. However,these methods lack generalization ability to complex real scenarios and consume much time in iteratively optimizing the designed prior. Meanwhile, deep-learning-based fusion methods can automatically learn the prior knowledge from large-scale datasets. Although these methods often achieve better fusion results compared with traditional-model-based fusion methods,they do not jointly explore the inner self-similarity of multi-source spectral images,where the LR-HSI shows high correlation in the spectral dimension and the HR-MSI shows spatial similarities in texture and edges. In addition,the weights of these convolution-based networks are learned during training but are fixed during testing,hence limiting the potential adaptability of networks. To effectively exploit the inner spatial and spectral similarity of spectral images,we propose an MSI and HSI fusion network with a joint self-attention fusion module and Transformer. Method Given that LR-HSI has reliable information in the spectral dimension,the critical task of the HSI fusion method is to fill the missing texture details in the spatial dimension without losing discriminable spectral information. Given the LR-HSI and its matching HR-MSI,our proposed method fuses these two spectral images to obtain the desired HR-HSI in three steps. First,the similarity information of LR-HSI and HR-MSI is extracted by the joint self-attention module. Specifically,the spectral similarity features from LR-HSI are extracted by the channel attention module,and the spatial similarity features from HR-MSI are extracted by the spatial attention module. The obtained similarity features are then used to guide the fusion process. Second,to achieve a deep representation and explore the long-range dependencies of the fusion features,the preliminary fusion features are fed into the deep Transformer network,which comprises a shift window attention module,LayerNorm,and multilayer perceptron. The convolution layer and skip connection are also included in the proposed Transformer fusion network to further enhance the model flexibility. Third,the fusion features from Transformer are mapped to the desired high-resolution HSI. The overall network is implemented by the Pytorch framework and trained in an end-to-end manner. To generate training data,the training images are cropped to the size of 96 × 96 × 31,resulting in approximately 8 000 training patches that are smoothed by a Gaussian blur kernel and spatially down-sampled to obtain LR-HSI. The MSI images are generated by the spectral response function of a Nikon D700 camera. Result We compare our method with seven state-of-the-art fusion methods, including one traditional-model-based method and six deep-learning-based methods. The peak-signal-to-noise ratio (PSNR),structural similarity index measure(SSIM),erreur relative globale adimensionnelle de Synthèse(ERGAS),and spectral angle mapper(SAM)are utilized as quantitative metrics in evaluating the performance of these fusion methods. To verify the effectiveness of the proposed model,we perform experiments on two widely used HSI datasets,namely,the CAVE and Harvard datasets. For the CAVE dataset,the first 20 images are selected for training,and the last 12 images are used for testing. Similarly,for the Harvard dataset,the first 30 images are selected for training,and the last 20 images are used for testing. Experimental results under different scale factors show that the proposed method achieves better fusion results in terms of quantitative metrics and visual effects compared to the other state-of-the-art methods. Under a scale factor of 8,the PSNR,SAM,and ERGAS of the proposed method is improved by 0. 5 dB,0. 13,and 0. 2,respectively,com-pared to EDBIN,which is the second best-performing method on the CAVE dataset. Under a scale factor 16,the PSNR of the proposed method is improved by at least 0. 4 dB compared to the other methods on the Harvard dataset. The visual results show that our proposed method outperforms the other methods in recovering both fine-grained spatial textures and spectral details. The ablation study also proves that the employed Transformer fusion network significantly improves the fusion process. Conclusion In this paper,we propose a Transformer-based MSI and HSI fusion network with a joint self-attention fusion module,which can effectively utilize the spectral similarity of LR-HSI and the spatial similarity of HR-MSI to guide the fusion process through a 2D attention mechanism. The preliminary fusion results pass through the residual Transformer network to obtain a deep feature representation and to reconstruct the desired HR-HSI. Qualitative and quantitative experiments show that the proposed method has better spectral fidelity and spatial resolution compared to the state-of-the-art HSI fusion methods.

Translated title of the contribution	Joint self-attention Transformer for multispectral and hyperspectral image fusion
Original language	Chinese (Traditional)
Pages (from-to)	3922-3934
Number of pages	13
Journal	Journal of Image and Graphics
Volume	28
Issue number	12
DOIs	https://doi.org/10.11834/jig.220954
Publication status	Published - Dec 2023

Access to Document

10.11834/jig.220954

Cite this

@article{347fc8344d464829b1b4cf9b9cf98604,

title = "用于多光谱和高光谱图像融合的联合自注意力 Transformer",

abstract = "Objective Hyperspectral image(HSI)contains rich spectral information and has advantages over multispectral image(MSI)in accurately distinguishing different types of materials. Therefore,HSI has been widely used in many computer vision tasks,including vegetation detection,face recognition,and feature segmentation. However,due to the limitations in hardware equipment and the acquisition environment,an inevitable trade-off arises between spatial resolution and spectral resolution. Thus,HSIs under real scenes often have low spatial resolution,which negatively affects the performance of subsequent vision tasks. By fusing the low-resolution HSI(LR-HSI)with a high-resolution MSI(HR-MSI)under the same scene using the HSI super-resolution algorithm,the spatial resolution of HSIs can be effectively improved. Existing HSI fusion algorithms can be roughly classified into traditional-model-based and deep-learning-based methods. Traditional-model-based fusion methods employ various handcrafted shallow priors(e. g. ,matrix/tensor factorization,total variation,and low rank)to utilize the intrinsic statistics of observed spectral images. However,these methods lack generalization ability to complex real scenarios and consume much time in iteratively optimizing the designed prior. Meanwhile, deep-learning-based fusion methods can automatically learn the prior knowledge from large-scale datasets. Although these methods often achieve better fusion results compared with traditional-model-based fusion methods,they do not jointly explore the inner self-similarity of multi-source spectral images,where the LR-HSI shows high correlation in the spectral dimension and the HR-MSI shows spatial similarities in texture and edges. In addition,the weights of these convolution-based networks are learned during training but are fixed during testing,hence limiting the potential adaptability of networks. To effectively exploit the inner spatial and spectral similarity of spectral images,we propose an MSI and HSI fusion network with a joint self-attention fusion module and Transformer. Method Given that LR-HSI has reliable information in the spectral dimension,the critical task of the HSI fusion method is to fill the missing texture details in the spatial dimension without losing discriminable spectral information. Given the LR-HSI and its matching HR-MSI,our proposed method fuses these two spectral images to obtain the desired HR-HSI in three steps. First,the similarity information of LR-HSI and HR-MSI is extracted by the joint self-attention module. Specifically,the spectral similarity features from LR-HSI are extracted by the channel attention module,and the spatial similarity features from HR-MSI are extracted by the spatial attention module. The obtained similarity features are then used to guide the fusion process. Second,to achieve a deep representation and explore the long-range dependencies of the fusion features,the preliminary fusion features are fed into the deep Transformer network,which comprises a shift window attention module,LayerNorm,and multilayer perceptron. The convolution layer and skip connection are also included in the proposed Transformer fusion network to further enhance the model flexibility. Third,the fusion features from Transformer are mapped to the desired high-resolution HSI. The overall network is implemented by the Pytorch framework and trained in an end-to-end manner. To generate training data,the training images are cropped to the size of 96 × 96 × 31,resulting in approximately 8 000 training patches that are smoothed by a Gaussian blur kernel and spatially down-sampled to obtain LR-HSI. The MSI images are generated by the spectral response function of a Nikon D700 camera. Result We compare our method with seven state-of-the-art fusion methods, including one traditional-model-based method and six deep-learning-based methods. The peak-signal-to-noise ratio (PSNR),structural similarity index measure(SSIM),erreur relative globale adimensionnelle de Synth{\`e}se(ERGAS),and spectral angle mapper(SAM)are utilized as quantitative metrics in evaluating the performance of these fusion methods. To verify the effectiveness of the proposed model,we perform experiments on two widely used HSI datasets,namely,the CAVE and Harvard datasets. For the CAVE dataset,the first 20 images are selected for training,and the last 12 images are used for testing. Similarly,for the Harvard dataset,the first 30 images are selected for training,and the last 20 images are used for testing. Experimental results under different scale factors show that the proposed method achieves better fusion results in terms of quantitative metrics and visual effects compared to the other state-of-the-art methods. Under a scale factor of 8,the PSNR,SAM,and ERGAS of the proposed method is improved by 0. 5 dB,0. 13,and 0. 2,respectively,com-pared to EDBIN,which is the second best-performing method on the CAVE dataset. Under a scale factor 16,the PSNR of the proposed method is improved by at least 0. 4 dB compared to the other methods on the Harvard dataset. The visual results show that our proposed method outperforms the other methods in recovering both fine-grained spatial textures and spectral details. The ablation study also proves that the employed Transformer fusion network significantly improves the fusion process. Conclusion In this paper,we propose a Transformer-based MSI and HSI fusion network with a joint self-attention fusion module,which can effectively utilize the spectral similarity of LR-HSI and the spatial similarity of HR-MSI to guide the fusion process through a 2D attention mechanism. The preliminary fusion results pass through the residual Transformer network to obtain a deep feature representation and to reconstruct the desired HR-HSI. Qualitative and quantitative experiments show that the proposed method has better spectral fidelity and spatial resolution compared to the state-of-the-art HSI fusion methods.",

keywords = "Transformer, fusion method, hyperspectral images, joint self-attention, multispectral images, super-resolution",

author = "Miaoyu Li and Ying Fu",

note = "Publisher Copyright: {\textcopyright} The Author(s), 2023.",

year = "2023",

month = dec,

doi = "10.11834/jig.220954",

language = "繁体中文",

volume = "28",

pages = "3922--3934",

journal = "Journal of Image and Graphics",

issn = "1006-8961",

publisher = "Editorial and Publishing Board of JIG",

number = "12",

}

TY - JOUR

T1 - 用于多光谱和高光谱图像融合的联合自注意力 Transformer

AU - Li, Miaoyu

AU - Fu, Ying

N1 - Publisher Copyright: © The Author(s), 2023.

PY - 2023/12

Y1 - 2023/12

N2 - Objective Hyperspectral image(HSI)contains rich spectral information and has advantages over multispectral image(MSI)in accurately distinguishing different types of materials. Therefore,HSI has been widely used in many computer vision tasks,including vegetation detection,face recognition,and feature segmentation. However,due to the limitations in hardware equipment and the acquisition environment,an inevitable trade-off arises between spatial resolution and spectral resolution. Thus,HSIs under real scenes often have low spatial resolution,which negatively affects the performance of subsequent vision tasks. By fusing the low-resolution HSI(LR-HSI)with a high-resolution MSI(HR-MSI)under the same scene using the HSI super-resolution algorithm,the spatial resolution of HSIs can be effectively improved. Existing HSI fusion algorithms can be roughly classified into traditional-model-based and deep-learning-based methods. Traditional-model-based fusion methods employ various handcrafted shallow priors(e. g. ,matrix/tensor factorization,total variation,and low rank)to utilize the intrinsic statistics of observed spectral images. However,these methods lack generalization ability to complex real scenarios and consume much time in iteratively optimizing the designed prior. Meanwhile, deep-learning-based fusion methods can automatically learn the prior knowledge from large-scale datasets. Although these methods often achieve better fusion results compared with traditional-model-based fusion methods,they do not jointly explore the inner self-similarity of multi-source spectral images,where the LR-HSI shows high correlation in the spectral dimension and the HR-MSI shows spatial similarities in texture and edges. In addition,the weights of these convolution-based networks are learned during training but are fixed during testing,hence limiting the potential adaptability of networks. To effectively exploit the inner spatial and spectral similarity of spectral images,we propose an MSI and HSI fusion network with a joint self-attention fusion module and Transformer. Method Given that LR-HSI has reliable information in the spectral dimension,the critical task of the HSI fusion method is to fill the missing texture details in the spatial dimension without losing discriminable spectral information. Given the LR-HSI and its matching HR-MSI,our proposed method fuses these two spectral images to obtain the desired HR-HSI in three steps. First,the similarity information of LR-HSI and HR-MSI is extracted by the joint self-attention module. Specifically,the spectral similarity features from LR-HSI are extracted by the channel attention module,and the spatial similarity features from HR-MSI are extracted by the spatial attention module. The obtained similarity features are then used to guide the fusion process. Second,to achieve a deep representation and explore the long-range dependencies of the fusion features,the preliminary fusion features are fed into the deep Transformer network,which comprises a shift window attention module,LayerNorm,and multilayer perceptron. The convolution layer and skip connection are also included in the proposed Transformer fusion network to further enhance the model flexibility. Third,the fusion features from Transformer are mapped to the desired high-resolution HSI. The overall network is implemented by the Pytorch framework and trained in an end-to-end manner. To generate training data,the training images are cropped to the size of 96 × 96 × 31,resulting in approximately 8 000 training patches that are smoothed by a Gaussian blur kernel and spatially down-sampled to obtain LR-HSI. The MSI images are generated by the spectral response function of a Nikon D700 camera. Result We compare our method with seven state-of-the-art fusion methods, including one traditional-model-based method and six deep-learning-based methods. The peak-signal-to-noise ratio (PSNR),structural similarity index measure(SSIM),erreur relative globale adimensionnelle de Synthèse(ERGAS),and spectral angle mapper(SAM)are utilized as quantitative metrics in evaluating the performance of these fusion methods. To verify the effectiveness of the proposed model,we perform experiments on two widely used HSI datasets,namely,the CAVE and Harvard datasets. For the CAVE dataset,the first 20 images are selected for training,and the last 12 images are used for testing. Similarly,for the Harvard dataset,the first 30 images are selected for training,and the last 20 images are used for testing. Experimental results under different scale factors show that the proposed method achieves better fusion results in terms of quantitative metrics and visual effects compared to the other state-of-the-art methods. Under a scale factor of 8,the PSNR,SAM,and ERGAS of the proposed method is improved by 0. 5 dB,0. 13,and 0. 2,respectively,com-pared to EDBIN,which is the second best-performing method on the CAVE dataset. Under a scale factor 16,the PSNR of the proposed method is improved by at least 0. 4 dB compared to the other methods on the Harvard dataset. The visual results show that our proposed method outperforms the other methods in recovering both fine-grained spatial textures and spectral details. The ablation study also proves that the employed Transformer fusion network significantly improves the fusion process. Conclusion In this paper,we propose a Transformer-based MSI and HSI fusion network with a joint self-attention fusion module,which can effectively utilize the spectral similarity of LR-HSI and the spatial similarity of HR-MSI to guide the fusion process through a 2D attention mechanism. The preliminary fusion results pass through the residual Transformer network to obtain a deep feature representation and to reconstruct the desired HR-HSI. Qualitative and quantitative experiments show that the proposed method has better spectral fidelity and spatial resolution compared to the state-of-the-art HSI fusion methods.

AB - Objective Hyperspectral image(HSI)contains rich spectral information and has advantages over multispectral image(MSI)in accurately distinguishing different types of materials. Therefore,HSI has been widely used in many computer vision tasks,including vegetation detection,face recognition,and feature segmentation. However,due to the limitations in hardware equipment and the acquisition environment,an inevitable trade-off arises between spatial resolution and spectral resolution. Thus,HSIs under real scenes often have low spatial resolution,which negatively affects the performance of subsequent vision tasks. By fusing the low-resolution HSI(LR-HSI)with a high-resolution MSI(HR-MSI)under the same scene using the HSI super-resolution algorithm,the spatial resolution of HSIs can be effectively improved. Existing HSI fusion algorithms can be roughly classified into traditional-model-based and deep-learning-based methods. Traditional-model-based fusion methods employ various handcrafted shallow priors(e. g. ,matrix/tensor factorization,total variation,and low rank)to utilize the intrinsic statistics of observed spectral images. However,these methods lack generalization ability to complex real scenarios and consume much time in iteratively optimizing the designed prior. Meanwhile, deep-learning-based fusion methods can automatically learn the prior knowledge from large-scale datasets. Although these methods often achieve better fusion results compared with traditional-model-based fusion methods,they do not jointly explore the inner self-similarity of multi-source spectral images,where the LR-HSI shows high correlation in the spectral dimension and the HR-MSI shows spatial similarities in texture and edges. In addition,the weights of these convolution-based networks are learned during training but are fixed during testing,hence limiting the potential adaptability of networks. To effectively exploit the inner spatial and spectral similarity of spectral images,we propose an MSI and HSI fusion network with a joint self-attention fusion module and Transformer. Method Given that LR-HSI has reliable information in the spectral dimension,the critical task of the HSI fusion method is to fill the missing texture details in the spatial dimension without losing discriminable spectral information. Given the LR-HSI and its matching HR-MSI,our proposed method fuses these two spectral images to obtain the desired HR-HSI in three steps. First,the similarity information of LR-HSI and HR-MSI is extracted by the joint self-attention module. Specifically,the spectral similarity features from LR-HSI are extracted by the channel attention module,and the spatial similarity features from HR-MSI are extracted by the spatial attention module. The obtained similarity features are then used to guide the fusion process. Second,to achieve a deep representation and explore the long-range dependencies of the fusion features,the preliminary fusion features are fed into the deep Transformer network,which comprises a shift window attention module,LayerNorm,and multilayer perceptron. The convolution layer and skip connection are also included in the proposed Transformer fusion network to further enhance the model flexibility. Third,the fusion features from Transformer are mapped to the desired high-resolution HSI. The overall network is implemented by the Pytorch framework and trained in an end-to-end manner. To generate training data,the training images are cropped to the size of 96 × 96 × 31,resulting in approximately 8 000 training patches that are smoothed by a Gaussian blur kernel and spatially down-sampled to obtain LR-HSI. The MSI images are generated by the spectral response function of a Nikon D700 camera. Result We compare our method with seven state-of-the-art fusion methods, including one traditional-model-based method and six deep-learning-based methods. The peak-signal-to-noise ratio (PSNR),structural similarity index measure(SSIM),erreur relative globale adimensionnelle de Synthèse(ERGAS),and spectral angle mapper(SAM)are utilized as quantitative metrics in evaluating the performance of these fusion methods. To verify the effectiveness of the proposed model,we perform experiments on two widely used HSI datasets,namely,the CAVE and Harvard datasets. For the CAVE dataset,the first 20 images are selected for training,and the last 12 images are used for testing. Similarly,for the Harvard dataset,the first 30 images are selected for training,and the last 20 images are used for testing. Experimental results under different scale factors show that the proposed method achieves better fusion results in terms of quantitative metrics and visual effects compared to the other state-of-the-art methods. Under a scale factor of 8,the PSNR,SAM,and ERGAS of the proposed method is improved by 0. 5 dB,0. 13,and 0. 2,respectively,com-pared to EDBIN,which is the second best-performing method on the CAVE dataset. Under a scale factor 16,the PSNR of the proposed method is improved by at least 0. 4 dB compared to the other methods on the Harvard dataset. The visual results show that our proposed method outperforms the other methods in recovering both fine-grained spatial textures and spectral details. The ablation study also proves that the employed Transformer fusion network significantly improves the fusion process. Conclusion In this paper,we propose a Transformer-based MSI and HSI fusion network with a joint self-attention fusion module,which can effectively utilize the spectral similarity of LR-HSI and the spatial similarity of HR-MSI to guide the fusion process through a 2D attention mechanism. The preliminary fusion results pass through the residual Transformer network to obtain a deep feature representation and to reconstruct the desired HR-HSI. Qualitative and quantitative experiments show that the proposed method has better spectral fidelity and spatial resolution compared to the state-of-the-art HSI fusion methods.

KW - Transformer

KW - fusion method

KW - hyperspectral images

KW - joint self-attention

KW - multispectral images

KW - super-resolution

UR - http://www.scopus.com/inward/record.url?scp=85182878324&partnerID=8YFLogxK

U2 - 10.11834/jig.220954

DO - 10.11834/jig.220954

M3 - 文章

AN - SCOPUS:85182878324

SN - 1006-8961

VL - 28

SP - 3922

EP - 3934

JO - Journal of Image and Graphics

JF - Journal of Image and Graphics

IS - 12

ER -

用于多光谱和高光谱图像融合的联合自注意力 Transformer

Abstract

Access to Document

Other files and links

Fingerprint

Cite this