TY - GEN
T1 - Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
AU - Xia, Jianqiang
AU - Shi, Dianxi
AU - Song, Ke
AU - Song, Linna
AU - Wang, Xiaolei
AU - Jin, Songchang
AU - Zhao, Chenran
AU - Cheng, Yu
AU - Jin, Lei
AU - Zhu, Zheng
AU - Li, Jianan
AU - Wang, Gang
AU - Xing, Junliang
AU - Zhao, Jian
N1 - Publisher Copyright:
© 2024 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network's ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone through joint feature extraction, fusion and relation modeling. With this structure, the network can not only extract the fusion features of templates and search regions under the interaction of modalities, but also significantly improve tracking speed through the single-stage fusion tracking paradigm. Furthermore, we introduce a novel feature selection mechanism based on modality reliability to mitigate the influence of invalid modalities for final prediction. Extensive experiments on three mainstream RGB-T tracking benchmarks show that our method achieves the new state-of-the-art while achieving the fastest tracking speed of 84.2FPS. Code is available at https://github.com/xiajianqiang/USTrack.
AB - Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network's ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone through joint feature extraction, fusion and relation modeling. With this structure, the network can not only extract the fusion features of templates and search regions under the interaction of modalities, but also significantly improve tracking speed through the single-stage fusion tracking paradigm. Furthermore, we introduce a novel feature selection mechanism based on modality reliability to mitigate the influence of invalid modalities for final prediction. Extensive experiments on three mainstream RGB-T tracking benchmarks show that our method achieves the new state-of-the-art while achieving the fastest tracking speed of 84.2FPS. Code is available at https://github.com/xiajianqiang/USTrack.
UR - http://www.scopus.com/inward/record.url?scp=85204308809&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204308809
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 1471
EP - 1479
BT - Proceedings of the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
A2 - Larson, Kate
PB - International Joint Conferences on Artificial Intelligence
T2 - 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
Y2 - 3 August 2024 through 9 August 2024
ER -