Efficient Multispectral Object Detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance

Zhongxia Xiong; Ziying Yao; Xuan Liu; Wenyao Zhao; Jie Cao; Xinkai Wu

doi:10.1016/j.inffus.2025.102939

Efficient Multispectral Object Detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance

Zhongxia Xiong, Ziying Yao, Xuan Liu, Wenyao Zhao, Jie Cao, Xinkai Wu^*

^*此作品的通讯作者

光电学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

With visible imagery and thermal sensing data, multispectral object detection facilitates around-the-clock perception for applications such as autonomous driving. Infrared input serves as auxiliary data for cross-modality feature aggregation, a common approach demonstrated to be successful by numerous previous studies. Nevertheless, despite the inclusion of complex and time-consuming modules in many existing methods, effective information fusion remains a formidable challenge due to severe spatiotemporal misalignment and modality imbalance between visible and thermal images. Thus, this paper intends to lift both the accuracy and speed for RGB-infrared perception. To this end, an illumination-guided attentive feature aggregation model (EMOD) is introduced to achieve Efficient Multispectral Object Detection. Firstly, EMOD employs feature fusion with a local-to-nonlocal cross-modality attention mechanism, which not only mitigates pixel-wise positional variation but also captures context-level complementary information. Furthermore, to address the modality imbalance issue, a signal indicating illumination conditions is implicitly embedded into the aggregation module to guide attentive computation. Unlike previous works, this signal is more potent and practical as it functions by denoting regional lighting conditions and without requiring additional training labels. Comprehensive experiments are conducted on three widely used datasets, including KAIST, CVC-14 and FLIR. Without bells and whistles, EMOD surpasses state-of-the-art approaches in terms of both effectiveness and efficiency. For example, it achieves a 5.96 MR score on KAIST while maintaining a speed of 28 FPS on a low-cost GPU.

源语言	英语
文章编号	102939
期刊	Information Fusion
卷	118
DOI	https://doi.org/10.1016/j.inffus.2025.102939
出版状态	已出版 - 6月 2025

访问文件

10.1016/j.inffus.2025.102939

其它文件与链接

链接到 Scopus 的出版物

引用此

Xiong, Z., Yao, Z., Liu, X., Zhao, W., Cao, J., & Wu, X. (2025). Efficient Multispectral Object Detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance. Information Fusion, 118, 文章 102939. https://doi.org/10.1016/j.inffus.2025.102939

@article{a0d78cca17ee4385be7a3613845bf177,

title = "Efficient Multispectral Object Detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance",

abstract = "With visible imagery and thermal sensing data, multispectral object detection facilitates around-the-clock perception for applications such as autonomous driving. Infrared input serves as auxiliary data for cross-modality feature aggregation, a common approach demonstrated to be successful by numerous previous studies. Nevertheless, despite the inclusion of complex and time-consuming modules in many existing methods, effective information fusion remains a formidable challenge due to severe spatiotemporal misalignment and modality imbalance between visible and thermal images. Thus, this paper intends to lift both the accuracy and speed for RGB-infrared perception. To this end, an illumination-guided attentive feature aggregation model (EMOD) is introduced to achieve Efficient Multispectral Object Detection. Firstly, EMOD employs feature fusion with a local-to-nonlocal cross-modality attention mechanism, which not only mitigates pixel-wise positional variation but also captures context-level complementary information. Furthermore, to address the modality imbalance issue, a signal indicating illumination conditions is implicitly embedded into the aggregation module to guide attentive computation. Unlike previous works, this signal is more potent and practical as it functions by denoting regional lighting conditions and without requiring additional training labels. Comprehensive experiments are conducted on three widely used datasets, including KAIST, CVC-14 and FLIR. Without bells and whistles, EMOD surpasses state-of-the-art approaches in terms of both effectiveness and efficiency. For example, it achieves a 5.96 MR score on KAIST while maintaining a speed of 28 FPS on a low-cost GPU.",

keywords = "Feature fusion, Light estimation, Multispectral, Object detection, Real time",

author = "Zhongxia Xiong and Ziying Yao and Xuan Liu and Wenyao Zhao and Jie Cao and Xinkai Wu",

note = "Publisher Copyright: {\textcopyright} 2025",

year = "2025",

month = jun,

doi = "10.1016/j.inffus.2025.102939",

language = "English",

volume = "118",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Efficient Multispectral Object Detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance

AU - Xiong, Zhongxia

AU - Yao, Ziying

AU - Liu, Xuan

AU - Zhao, Wenyao

AU - Cao, Jie

AU - Wu, Xinkai

PY - 2025/6

Y1 - 2025/6

N2 - With visible imagery and thermal sensing data, multispectral object detection facilitates around-the-clock perception for applications such as autonomous driving. Infrared input serves as auxiliary data for cross-modality feature aggregation, a common approach demonstrated to be successful by numerous previous studies. Nevertheless, despite the inclusion of complex and time-consuming modules in many existing methods, effective information fusion remains a formidable challenge due to severe spatiotemporal misalignment and modality imbalance between visible and thermal images. Thus, this paper intends to lift both the accuracy and speed for RGB-infrared perception. To this end, an illumination-guided attentive feature aggregation model (EMOD) is introduced to achieve Efficient Multispectral Object Detection. Firstly, EMOD employs feature fusion with a local-to-nonlocal cross-modality attention mechanism, which not only mitigates pixel-wise positional variation but also captures context-level complementary information. Furthermore, to address the modality imbalance issue, a signal indicating illumination conditions is implicitly embedded into the aggregation module to guide attentive computation. Unlike previous works, this signal is more potent and practical as it functions by denoting regional lighting conditions and without requiring additional training labels. Comprehensive experiments are conducted on three widely used datasets, including KAIST, CVC-14 and FLIR. Without bells and whistles, EMOD surpasses state-of-the-art approaches in terms of both effectiveness and efficiency. For example, it achieves a 5.96 MR score on KAIST while maintaining a speed of 28 FPS on a low-cost GPU.

AB - With visible imagery and thermal sensing data, multispectral object detection facilitates around-the-clock perception for applications such as autonomous driving. Infrared input serves as auxiliary data for cross-modality feature aggregation, a common approach demonstrated to be successful by numerous previous studies. Nevertheless, despite the inclusion of complex and time-consuming modules in many existing methods, effective information fusion remains a formidable challenge due to severe spatiotemporal misalignment and modality imbalance between visible and thermal images. Thus, this paper intends to lift both the accuracy and speed for RGB-infrared perception. To this end, an illumination-guided attentive feature aggregation model (EMOD) is introduced to achieve Efficient Multispectral Object Detection. Firstly, EMOD employs feature fusion with a local-to-nonlocal cross-modality attention mechanism, which not only mitigates pixel-wise positional variation but also captures context-level complementary information. Furthermore, to address the modality imbalance issue, a signal indicating illumination conditions is implicitly embedded into the aggregation module to guide attentive computation. Unlike previous works, this signal is more potent and practical as it functions by denoting regional lighting conditions and without requiring additional training labels. Comprehensive experiments are conducted on three widely used datasets, including KAIST, CVC-14 and FLIR. Without bells and whistles, EMOD surpasses state-of-the-art approaches in terms of both effectiveness and efficiency. For example, it achieves a 5.96 MR score on KAIST while maintaining a speed of 28 FPS on a low-cost GPU.

KW - Feature fusion

KW - Light estimation

KW - Multispectral

KW - Object detection

KW - Real time

UR - http://www.scopus.com/inward/record.url?scp=85214786921&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2025.102939

DO - 10.1016/j.inffus.2025.102939

M3 - Article

AN - SCOPUS:85214786921

SN - 1566-2535

VL - 118

JO - Information Fusion

JF - Information Fusion

M1 - 102939

ER -

Efficient Multispectral Object Detection with attentive feature aggregation leveraging zero-shot implicit illumination guidance

摘要

访问文件

其它文件与链接

指纹

引用此