Visual tracking using transformer with a combination of convolution and attention

Yuxuan Wang, Liping Yan*, Zihang Feng, Yuanqing Xia, Bo Xiao

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

For Siamese-based trackers in the field of single object tracking, cross-correlation operation plays an important role. However, the cross-correlation essentially uses target feature to locally linearly match the search region, which leads to insufficient utilization or even loss of feature information. To effectively employ global context and sufficiently explore the relevance of template and search region, a novel matching operator is designed inspired by Transformer, which uses multi-head attention and embed a designed modulation module across the inputs of operator. Meanwhile, we equip our tracker with a multi-scale encoder/decoder strategy to gradually make more precise tracking. Finally, a complete tracking framework is presented named VTTR. The tracker consists of a feature extractor, a multi-scale encoder based on depth-wise convolution, a modified decoder as the matching operator and a prediction head. The proposed tracker is tested on many benchmarks and achieve excellent performance while running with fast speed.

源语言英语
文章编号104760
期刊Image and Vision Computing
137
DOI
出版状态已出版 - 9月 2023

指纹

探究 'Visual tracking using transformer with a combination of convolution and attention' 的科研主题。它们共同构成独一无二的指纹。

引用此