TY - JOUR
T1 - Siamese Visual Tracking with Multi-Parallel Interactive Transformers
AU - Wang, Wuwei
AU - Lv, Meibo
AU - Zhu, Lin
AU - Han, Tuo
AU - Zhang, Yi
AU - Li, Yuanqing
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - In recent years, Siamese network-based visual tracking methods have gained popularity and success in terms of efficiency and accuracy. However, typical Siamese trackers utilize two independent weight-sharing streams to describe the exemplar and search region without any interaction between the two streams. As a result, such trackers employ only shallow cross-correlation or correlation filters to obtain the final information association, which neglects the deep interaction between the exemplar and search region and may reduce the discriminative power of the trackers. To address this issue, we propose a novel multi-parallel interactive transformer-based (MPIT) tracking framework to introduce sufficient interaction so that the two streams can guide the prediction heads to focus on the target more easily. Unlike recent one-stream transformer-based trackers that directly concatenate template and search tokens to perform joint feature learning, our multi-parallel interactive framework introduces a transmission band module to deliver global information for both the exemplar and the search region with low computational cost. Moreover, to integrate dynamic information, we incorporate temporal level extraction into the tracking framework to increase the variety of the templates. The experimental results show that the proposed MPIT method achieves a remarkable tracking speed of 136 frames per second (FPS) while attaining performance better than or comparable to that of state-of-the-art trackers.
AB - In recent years, Siamese network-based visual tracking methods have gained popularity and success in terms of efficiency and accuracy. However, typical Siamese trackers utilize two independent weight-sharing streams to describe the exemplar and search region without any interaction between the two streams. As a result, such trackers employ only shallow cross-correlation or correlation filters to obtain the final information association, which neglects the deep interaction between the exemplar and search region and may reduce the discriminative power of the trackers. To address this issue, we propose a novel multi-parallel interactive transformer-based (MPIT) tracking framework to introduce sufficient interaction so that the two streams can guide the prediction heads to focus on the target more easily. Unlike recent one-stream transformer-based trackers that directly concatenate template and search tokens to perform joint feature learning, our multi-parallel interactive framework introduces a transmission band module to deliver global information for both the exemplar and the search region with low computational cost. Moreover, to integrate dynamic information, we incorporate temporal level extraction into the tracking framework to increase the variety of the templates. The experimental results show that the proposed MPIT method achieves a remarkable tracking speed of 136 frames per second (FPS) while attaining performance better than or comparable to that of state-of-the-art trackers.
KW - interaction information
KW - multi-parallel transformers
KW - Siamese networks
KW - Visual tracking
UR - http://www.scopus.com/inward/record.url?scp=105005353843&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2025.3569633
DO - 10.1109/TCSVT.2025.3569633
M3 - Article
AN - SCOPUS:105005353843
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
M1 - 0b00006493f017d9
ER -