摘要
With visible imagery and thermal sensing data, multispectral object detection facilitates around-the-clock perception for applications such as autonomous driving. Infrared input serves as auxiliary data for cross-modality feature aggregation, a common approach demonstrated to be successful by numerous previous studies. Nevertheless, despite the inclusion of complex and time-consuming modules in many existing methods, effective information fusion remains a formidable challenge due to severe spatiotemporal misalignment and modality imbalance between visible and thermal images. Thus, this paper intends to lift both the accuracy and speed for RGB-infrared perception. To this end, an illumination-guided attentive feature aggregation model (EMOD) is introduced to achieve Efficient Multispectral Object Detection. Firstly, EMOD employs feature fusion with a local-to-nonlocal cross-modality attention mechanism, which not only mitigates pixel-wise positional variation but also captures context-level complementary information. Furthermore, to address the modality imbalance issue, a signal indicating illumination conditions is implicitly embedded into the aggregation module to guide attentive computation. Unlike previous works, this signal is more potent and practical as it functions by denoting regional lighting conditions and without requiring additional training labels. Comprehensive experiments are conducted on three widely used datasets, including KAIST, CVC-14 and FLIR. Without bells and whistles, EMOD surpasses state-of-the-art approaches in terms of both effectiveness and efficiency. For example, it achieves a 5.96 MR score on KAIST while maintaining a speed of 28 FPS on a low-cost GPU.
源语言 | 英语 |
---|---|
文章编号 | 102939 |
期刊 | Information Fusion |
卷 | 118 |
DOI | |
出版状态 | 已出版 - 6月 2025 |