TY - JOUR
T1 - ISCDFuse
T2 - Interval sampling correlation driven visual state space models for multimodal image fusion
AU - Zhang, Lian
AU - Wang, Lingxue
AU - Wu, Yuzhen
AU - Chen, Mingkun
AU - Zheng, Dezhi
AU - Cai, Yi
N1 - Publisher Copyright:
© 2025
PY - 2025/8/1
Y1 - 2025/8/1
N2 - Multimodal image fusion aims to retain functional highlights and detailed textures from different modalities. To address the shortcomings of existing methods regarding time complexity and cross-modal global information extraction efficiency, we propose a cross-domain distance learning image fusion framework based on Visual State Space Models (VSSMs) - the Interval Sampling Correlation-Driven Fusion Network (ISCDFuse). ISCDFuse employs a dual-branch feature extractor structure, comprising a Cross-domain Feature Association Encoder (CFAE), a High- and Low-frequency feature Extraction (HLoE) module, and a Vmamba-based Decoder (VD) for feature fusion and image generation. The CFAE traverses two modal space domains using an interval sampling cross-scan module, converting non-causal visual images into ordered patch sequences, thereby enhancing the correlation of global features across different modalities. The low-frequency feature extractors in the HLoE and VD modules both utilize a residual visual mamba structure, incorporating a multi-directional skip scanning approach that samples the image at bi-stride, enhancing deep semantic feature extraction and effectively modeling long-distance spatial dependencies. The high-frequency feature extractor employs the Invertible Neural Networks (INN) block to extract nuanced texture details. Extensive experiments have demonstrated that ISCDFuse delivers excellent fusion performance and fast speed across visible-infrared image fusion and medical image fusion. Notably, in unified benchmark tests, ISCDFuse significantly proves the practical value in downstream multimodal image processing, such as visible-infrared object detection.
AB - Multimodal image fusion aims to retain functional highlights and detailed textures from different modalities. To address the shortcomings of existing methods regarding time complexity and cross-modal global information extraction efficiency, we propose a cross-domain distance learning image fusion framework based on Visual State Space Models (VSSMs) - the Interval Sampling Correlation-Driven Fusion Network (ISCDFuse). ISCDFuse employs a dual-branch feature extractor structure, comprising a Cross-domain Feature Association Encoder (CFAE), a High- and Low-frequency feature Extraction (HLoE) module, and a Vmamba-based Decoder (VD) for feature fusion and image generation. The CFAE traverses two modal space domains using an interval sampling cross-scan module, converting non-causal visual images into ordered patch sequences, thereby enhancing the correlation of global features across different modalities. The low-frequency feature extractors in the HLoE and VD modules both utilize a residual visual mamba structure, incorporating a multi-directional skip scanning approach that samples the image at bi-stride, enhancing deep semantic feature extraction and effectively modeling long-distance spatial dependencies. The high-frequency feature extractor employs the Invertible Neural Networks (INN) block to extract nuanced texture details. Extensive experiments have demonstrated that ISCDFuse delivers excellent fusion performance and fast speed across visible-infrared image fusion and medical image fusion. Notably, in unified benchmark tests, ISCDFuse significantly proves the practical value in downstream multimodal image processing, such as visible-infrared object detection.
KW - Interval sampling
KW - Multimodal image fusion
KW - VIF and MIF
KW - Visual State Space Models
UR - http://www.scopus.com/inward/record.url?scp=105004260336&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2025.130329
DO - 10.1016/j.neucom.2025.130329
M3 - Article
AN - SCOPUS:105004260336
SN - 0925-2312
VL - 640
JO - Neurocomputing
JF - Neurocomputing
M1 - 130329
ER -