CMIFDF: A lightweight cross-modal image fusion and weight-sharing object detection network framework

Chunbo Zhao; Bo Mo; Jie Zhao; Yimeng Tao; Donghui Zhao

doi:10.1016/j.infrared.2024.105631

CMIFDF: A lightweight cross-modal image fusion and weight-sharing object detection network framework

Chunbo Zhao, Bo Mo^*, Jie Zhao, Yimeng Tao, Donghui Zhao

^*Corresponding author for this work

School of Aerospace Engineering

Research output: Contribution to journal › Article › peer-review

Abstract

In today's research, unimodal target detection can no longer meet the needs of target detection in complex backgrounds as well as harsh environments. To solve the problems of the existing cross-modal image fusion and cross-modal image target detection algorithms with network-heavy parameters and redundant network design, a selectable cross-modal image fusion and target detection algorithm framework (CMIFDF) is proposed. The framework consists of a lightweight dual-branch cross-modal image fusion network (LDFnet) and a cross-modal object detection algorithm with shareable weights (CM-YOLO) to rationally utilize the cross-modal image information and improve the performance of target detection under complex backgrounds. LDFnet is a two-branch fusion module based on depth-separable convolutional and attentional mechanisms. It can quickly and fully extract feature information from visible and infrared images. In CM-YOLO, fused images or raw images (visible and infrared) are fed into a target detection network with shareable weights for training and detection. A simplified asymptotic feature pyramid network (SAFPN) is proposed, and a lightweight multilayer perceptual attention module (LMA) is designed to enhance the fusion efficiency of the fusion network, so that efficient fusion of features can be achieved with fewer model parameters and low dissipation power to improve the network detection performance. Experiments on publicly available datasets show that the algorithmic framework can make full use of the feature information of cross-modal images as inputs and can effectively improve detection performance in complex environments.

Original language	English
Article number	105631
Journal	Infrared Physics and Technology
Volume	145
DOIs	https://doi.org/10.1016/j.infrared.2024.105631
Publication status	Published - Mar 2025

Keywords

CM-YOLO
Complex backgrounds
Cross-modal images fusion
Shareable weights

Access to Document

10.1016/j.infrared.2024.105631

Cite this

@article{41f33b3553a54ad0bba204c7bdb618d6,

title = "CMIFDF: A lightweight cross-modal image fusion and weight-sharing object detection network framework",

abstract = "In today's research, unimodal target detection can no longer meet the needs of target detection in complex backgrounds as well as harsh environments. To solve the problems of the existing cross-modal image fusion and cross-modal image target detection algorithms with network-heavy parameters and redundant network design, a selectable cross-modal image fusion and target detection algorithm framework (CMIFDF) is proposed. The framework consists of a lightweight dual-branch cross-modal image fusion network (LDFnet) and a cross-modal object detection algorithm with shareable weights (CM-YOLO) to rationally utilize the cross-modal image information and improve the performance of target detection under complex backgrounds. LDFnet is a two-branch fusion module based on depth-separable convolutional and attentional mechanisms. It can quickly and fully extract feature information from visible and infrared images. In CM-YOLO, fused images or raw images (visible and infrared) are fed into a target detection network with shareable weights for training and detection. A simplified asymptotic feature pyramid network (SAFPN) is proposed, and a lightweight multilayer perceptual attention module (LMA) is designed to enhance the fusion efficiency of the fusion network, so that efficient fusion of features can be achieved with fewer model parameters and low dissipation power to improve the network detection performance. Experiments on publicly available datasets show that the algorithmic framework can make full use of the feature information of cross-modal images as inputs and can effectively improve detection performance in complex environments.",

keywords = "CM-YOLO, Complex backgrounds, Cross-modal images fusion, Shareable weights",

author = "Chunbo Zhao and Bo Mo and Jie Zhao and Yimeng Tao and Donghui Zhao",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2025",

month = mar,

doi = "10.1016/j.infrared.2024.105631",

language = "English",

volume = "145",

journal = "Infrared Physics and Technology",

issn = "1350-4495",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - CMIFDF

T2 - A lightweight cross-modal image fusion and weight-sharing object detection network framework

AU - Zhao, Chunbo

AU - Mo, Bo

AU - Zhao, Jie

AU - Tao, Yimeng

AU - Zhao, Donghui

PY - 2025/3

Y1 - 2025/3

N2 - In today's research, unimodal target detection can no longer meet the needs of target detection in complex backgrounds as well as harsh environments. To solve the problems of the existing cross-modal image fusion and cross-modal image target detection algorithms with network-heavy parameters and redundant network design, a selectable cross-modal image fusion and target detection algorithm framework (CMIFDF) is proposed. The framework consists of a lightweight dual-branch cross-modal image fusion network (LDFnet) and a cross-modal object detection algorithm with shareable weights (CM-YOLO) to rationally utilize the cross-modal image information and improve the performance of target detection under complex backgrounds. LDFnet is a two-branch fusion module based on depth-separable convolutional and attentional mechanisms. It can quickly and fully extract feature information from visible and infrared images. In CM-YOLO, fused images or raw images (visible and infrared) are fed into a target detection network with shareable weights for training and detection. A simplified asymptotic feature pyramid network (SAFPN) is proposed, and a lightweight multilayer perceptual attention module (LMA) is designed to enhance the fusion efficiency of the fusion network, so that efficient fusion of features can be achieved with fewer model parameters and low dissipation power to improve the network detection performance. Experiments on publicly available datasets show that the algorithmic framework can make full use of the feature information of cross-modal images as inputs and can effectively improve detection performance in complex environments.

AB - In today's research, unimodal target detection can no longer meet the needs of target detection in complex backgrounds as well as harsh environments. To solve the problems of the existing cross-modal image fusion and cross-modal image target detection algorithms with network-heavy parameters and redundant network design, a selectable cross-modal image fusion and target detection algorithm framework (CMIFDF) is proposed. The framework consists of a lightweight dual-branch cross-modal image fusion network (LDFnet) and a cross-modal object detection algorithm with shareable weights (CM-YOLO) to rationally utilize the cross-modal image information and improve the performance of target detection under complex backgrounds. LDFnet is a two-branch fusion module based on depth-separable convolutional and attentional mechanisms. It can quickly and fully extract feature information from visible and infrared images. In CM-YOLO, fused images or raw images (visible and infrared) are fed into a target detection network with shareable weights for training and detection. A simplified asymptotic feature pyramid network (SAFPN) is proposed, and a lightweight multilayer perceptual attention module (LMA) is designed to enhance the fusion efficiency of the fusion network, so that efficient fusion of features can be achieved with fewer model parameters and low dissipation power to improve the network detection performance. Experiments on publicly available datasets show that the algorithmic framework can make full use of the feature information of cross-modal images as inputs and can effectively improve detection performance in complex environments.

KW - CM-YOLO

KW - Complex backgrounds

KW - Cross-modal images fusion

KW - Shareable weights

UR - http://www.scopus.com/inward/record.url?scp=85211030786&partnerID=8YFLogxK

U2 - 10.1016/j.infrared.2024.105631

DO - 10.1016/j.infrared.2024.105631

M3 - Article

AN - SCOPUS:85211030786

SN - 1350-4495

VL - 145

JO - Infrared Physics and Technology

JF - Infrared Physics and Technology

M1 - 105631

ER -

CMIFDF: A lightweight cross-modal image fusion and weight-sharing object detection network framework

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this