TY - JOUR
T1 - Mask-Guided Cross-Modality Fusion Network for Visible-Infrared Vehicle Detection
AU - Tian, Lingyun
AU - Shen, Qiang
AU - Deng, Zilong
AU - Gao, Yang
AU - Wang, Simiao
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Drone-based vehicle detection is crucial for intelligent traffic management. However, current methods relying solely on single visible or infrared modalities struggle with precision and robustness, especially in adverse weather conditions. The effective integration of cross-modal information to enhance vehicle detection still poses significant challenges. In this letter, we propose a masked-guided cross-modality fusion method, called MCMF, for robust and accurate visible-infrared vehicle detection. Firstly, we construct a framework consisting of three branches, with two dedicated to the visible and infrared modalities respectively, and another tailored for the fused multi-modal. Secondly, we introduce a Location-Sensitive Masked AutoEncoder (LMAE) for intermediate-level feature fusion. Specifically, our LMAE utilizes masks to cover intermediate-level features of one modality based on the prediction hierarchy of another modality, and then distills cross-modality guidance information through regularization constraints. This strategy, through a self-learning paradigm, effectively preserves the useful information from both modalities while eliminating redundant information from each. Finally, the fused features are input into an uncertainty-based detection head to generate predictions for bounding boxes of vehicles. When evaluated on the DroneVehicle dataset, our MCIF reaches 71.42% w.r..t. mAP, outperforming an established baseline method by 7.42%. Ablation studies further demonstrate the effectiveness of our LMAE for visible-infrared fusion.
AB - Drone-based vehicle detection is crucial for intelligent traffic management. However, current methods relying solely on single visible or infrared modalities struggle with precision and robustness, especially in adverse weather conditions. The effective integration of cross-modal information to enhance vehicle detection still poses significant challenges. In this letter, we propose a masked-guided cross-modality fusion method, called MCMF, for robust and accurate visible-infrared vehicle detection. Firstly, we construct a framework consisting of three branches, with two dedicated to the visible and infrared modalities respectively, and another tailored for the fused multi-modal. Secondly, we introduce a Location-Sensitive Masked AutoEncoder (LMAE) for intermediate-level feature fusion. Specifically, our LMAE utilizes masks to cover intermediate-level features of one modality based on the prediction hierarchy of another modality, and then distills cross-modality guidance information through regularization constraints. This strategy, through a self-learning paradigm, effectively preserves the useful information from both modalities while eliminating redundant information from each. Finally, the fused features are input into an uncertainty-based detection head to generate predictions for bounding boxes of vehicles. When evaluated on the DroneVehicle dataset, our MCIF reaches 71.42% w.r..t. mAP, outperforming an established baseline method by 7.42%. Ablation studies further demonstrate the effectiveness of our LMAE for visible-infrared fusion.
KW - Drone-based vehicle detection
KW - location-sensitive masked autoencoder
KW - masked guided cross-modality fusion
KW - regularization constraint
UR - http://www.scopus.com/inward/record.url?scp=105003460921&partnerID=8YFLogxK
U2 - 10.1109/LSP.2025.3562816
DO - 10.1109/LSP.2025.3562816
M3 - Article
AN - SCOPUS:105003460921
SN - 1070-9908
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -