TY - GEN
T1 - Temporal-Aware Visual Object Tracking with Pyramidal Transformer and Adaptive Decoupling
AU - Liang, Yiding
AU - Ma, Bo
AU - Xu, Hao
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - We propose a one-stream visual object tracking algorithm PVTrack. First, we propose a one-stream pyramidal backbone based on the attention mechanism, which computes the template and search region in parallel to improve the computational efficiency of the tracker, and in which the attention mechanism establishes global contextual information to optimize the tracking performance. Secondly, we propose an adaptive decoupled prediction head, which performs targeted computation on different layers of features output from the backbone: for the low-level semantic features that are rich in target shape information, feature fusion is used to improve the regression accuracy of the model; for the high-level semantic information that is good for classification, classification regression decoupling is used to improve the target localization accuracy by utilizing the high-level semantic features alone. Finally, we introduce the discriminative template updating method and design the template updating threshold function, so as to improve the algorithm's ability of modeling temporal information. In this paper, tests and ablation experiments are conducted on multiple datasets to verify that the proposed one-stream visual object tracking algorithm based on discriminative template updating can effectively improve the computational efficiency and robustness of tracking.
AB - We propose a one-stream visual object tracking algorithm PVTrack. First, we propose a one-stream pyramidal backbone based on the attention mechanism, which computes the template and search region in parallel to improve the computational efficiency of the tracker, and in which the attention mechanism establishes global contextual information to optimize the tracking performance. Secondly, we propose an adaptive decoupled prediction head, which performs targeted computation on different layers of features output from the backbone: for the low-level semantic features that are rich in target shape information, feature fusion is used to improve the regression accuracy of the model; for the high-level semantic information that is good for classification, classification regression decoupling is used to improve the target localization accuracy by utilizing the high-level semantic features alone. Finally, we introduce the discriminative template updating method and design the template updating threshold function, so as to improve the algorithm's ability of modeling temporal information. In this paper, tests and ablation experiments are conducted on multiple datasets to verify that the proposed one-stream visual object tracking algorithm based on discriminative template updating can effectively improve the computational efficiency and robustness of tracking.
KW - decoupling
KW - pyramid
KW - transformer
KW - visual object tracking
UR - http://www.scopus.com/inward/record.url?scp=85206899508&partnerID=8YFLogxK
U2 - 10.1109/AIEA62095.2024.10692929
DO - 10.1109/AIEA62095.2024.10692929
M3 - Conference contribution
AN - SCOPUS:85206899508
T3 - 2024 5th International Conference on Artificial Intelligence and Electromechanical Automation, AIEA 2024
SP - 270
EP - 277
BT - 2024 5th International Conference on Artificial Intelligence and Electromechanical Automation, AIEA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th International Conference on Artificial Intelligence and Electromechanical Automation, AIEA 2024
Y2 - 14 June 2024 through 16 June 2024
ER -