TY - JOUR
T1 - PromptFusion
T2 - Harmonized Semantic Prompt Learning for Infrared and Visible Image Fusion
AU - Liu, Jinyuan
AU - Li, Xingyuan
AU - Wang, Zirui
AU - Jiang, Zhiying
AU - Zhong, Wei
AU - Fan, Wei
AU - Xu, Bin
N1 - Publisher Copyright:
© 2025 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
PY - 2025
Y1 - 2025
N2 - The goal of infrared and visible image fusion (IVIF) is to integrate the unique advantages of both modalities to achieve a more comprehensive understanding of a scene. However, existing methods struggle to effectively handle modal disparities, resulting in visual degradation of the details and prominent targets of the fused images. To address these challenges, we introduce PromptFusion, a prompt-based approach that harmoniously combines multi-modality images under the guidance of semantic prompts. Firstly, to better characterize the features of different modalities, a contourlet autoencoder is designed to separate and extract the high-/low-frequency components of different modalities, thereby improving the extraction of fine details and textures. We also introduce a prompt learning mechanism using positive and negative prompts, leveraging Vision-Language Models to improve the fusion model's understanding and identification of targets in multi-modality images, leading to improved performance in downstream tasks. Furthermore, we employ bi-level asymptotic convergence optimization. This approach simplifies the intricate non-singleton non-convex bi-level problem into a series of convergent and differentiable single optimization problems that can be effectively resolved through gradient descent. Our approach advances the state-of-the-art, delivering superior fusion quality and boosting the performance of related downstream tasks.
AB - The goal of infrared and visible image fusion (IVIF) is to integrate the unique advantages of both modalities to achieve a more comprehensive understanding of a scene. However, existing methods struggle to effectively handle modal disparities, resulting in visual degradation of the details and prominent targets of the fused images. To address these challenges, we introduce PromptFusion, a prompt-based approach that harmoniously combines multi-modality images under the guidance of semantic prompts. Firstly, to better characterize the features of different modalities, a contourlet autoencoder is designed to separate and extract the high-/low-frequency components of different modalities, thereby improving the extraction of fine details and textures. We also introduce a prompt learning mechanism using positive and negative prompts, leveraging Vision-Language Models to improve the fusion model's understanding and identification of targets in multi-modality images, leading to improved performance in downstream tasks. Furthermore, we employ bi-level asymptotic convergence optimization. This approach simplifies the intricate non-singleton non-convex bi-level problem into a series of convergent and differentiable single optimization problems that can be effectively resolved through gradient descent. Our approach advances the state-of-the-art, delivering superior fusion quality and boosting the performance of related downstream tasks.
KW - Bi-level optimization
KW - image fusion
KW - infrared and visible image
KW - prompt learning
UR - http://www.scopus.com/inward/record.url?scp=105001208761&partnerID=8YFLogxK
U2 - 10.1109/JAS.2024.124878
DO - 10.1109/JAS.2024.124878
M3 - Article
AN - SCOPUS:105001208761
SN - 2329-9266
VL - 12
SP - 502
EP - 515
JO - IEEE/CAA Journal of Automatica Sinica
JF - IEEE/CAA Journal of Automatica Sinica
IS - 3
ER -