TY - JOUR
T1 - MCSSAFNet
T2 - A multi-scale state-space attention fusion network for RGBT tracking
AU - Zhao, Chunbo
AU - Mo, Bo
AU - Li, Dawei
AU - Wang, Xinchun
AU - Zhao, Jie
AU - Xu, Junwei
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2025/3
Y1 - 2025/3
N2 - Current cross-modal feature fusion research mainly adopts the deep features of the last layer of the backbone network as inputs, ignoring the utilization of detailed information in the shallow features of the backbone network, leading to certain limitations of the model in coping with the various challenges of rapid changes of the target in cross-modal images. To solve this problem, this paper proposes a novel tracker based on the Multiscale State Space Attention Fusion Network (MCSSAFNet), which realizes the learning and fusion of different modal feature information at different scales by introducing Mamba. On this basis, an adaptive-aware loss function is proposed to adaptively weight the classification loss firstly, to solve the imbalance between the classification score and the localization score by enhancing the learning attention to the difficult samples, and to improve the ability to discriminate the difficult targets. Adaptive weighting is then performed for IoU loss to enhance the learning of high-quality samples while improving the learning of low-quality samples, which in turn improves the model IoU accuracy. Comprehensive experimental validation is carried out on four mainstream RGBT open tracking datasets, namely, RGBT210, RGBT234, LasHeR, and VTUAV, and the experimental results show that the tracking performance of the proposed algorithm outperforms the existing algorithms and achieves a running speed of 37 fps on a GTX 3090 GPU.
AB - Current cross-modal feature fusion research mainly adopts the deep features of the last layer of the backbone network as inputs, ignoring the utilization of detailed information in the shallow features of the backbone network, leading to certain limitations of the model in coping with the various challenges of rapid changes of the target in cross-modal images. To solve this problem, this paper proposes a novel tracker based on the Multiscale State Space Attention Fusion Network (MCSSAFNet), which realizes the learning and fusion of different modal feature information at different scales by introducing Mamba. On this basis, an adaptive-aware loss function is proposed to adaptively weight the classification loss firstly, to solve the imbalance between the classification score and the localization score by enhancing the learning attention to the difficult samples, and to improve the ability to discriminate the difficult targets. Adaptive weighting is then performed for IoU loss to enhance the learning of high-quality samples while improving the learning of low-quality samples, which in turn improves the model IoU accuracy. Comprehensive experimental validation is carried out on four mainstream RGBT open tracking datasets, namely, RGBT210, RGBT234, LasHeR, and VTUAV, and the experimental results show that the tracking performance of the proposed algorithm outperforms the existing algorithms and achieves a running speed of 37 fps on a GTX 3090 GPU.
KW - Adaptive-aware loss
KW - Mamba
KW - Multiscale fusion
KW - RGBT tracking
KW - State space modeling
UR - http://www.scopus.com/inward/record.url?scp=85211752130&partnerID=8YFLogxK
U2 - 10.1016/j.optcom.2024.131394
DO - 10.1016/j.optcom.2024.131394
M3 - Article
AN - SCOPUS:85211752130
SN - 0030-4018
VL - 577
JO - Optics Communications
JF - Optics Communications
M1 - 131394
ER -