Visual tracking using transformer with a combination of convolution and attention

Yuxuan Wang; Liping Yan; Zihang Feng; Yuanqing Xia; Bo Xiao

doi:10.1016/j.imavis.2023.104760

Visual tracking using transformer with a combination of convolution and attention

Yuxuan Wang, Liping Yan^*, Zihang Feng, Yuanqing Xia, Bo Xiao

^*此作品的通讯作者

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

4 引用（Scopus）

摘要

For Siamese-based trackers in the field of single object tracking, cross-correlation operation plays an important role. However, the cross-correlation essentially uses target feature to locally linearly match the search region, which leads to insufficient utilization or even loss of feature information. To effectively employ global context and sufficiently explore the relevance of template and search region, a novel matching operator is designed inspired by Transformer, which uses multi-head attention and embed a designed modulation module across the inputs of operator. Meanwhile, we equip our tracker with a multi-scale encoder/decoder strategy to gradually make more precise tracking. Finally, a complete tracking framework is presented named VTTR. The tracker consists of a feature extractor, a multi-scale encoder based on depth-wise convolution, a modified decoder as the matching operator and a prediction head. The proposed tracker is tested on many benchmarks and achieve excellent performance while running with fast speed.

源语言	英语
文章编号	104760
期刊	Image and Vision Computing
卷	137
DOI	https://doi.org/10.1016/j.imavis.2023.104760
出版状态	已出版 - 9月 2023

访问文件

10.1016/j.imavis.2023.104760

其它文件与链接

链接到 Scopus 的出版物

引用此

Wang, Y., Yan, L., Feng, Z., Xia, Y., & Xiao, B. (2023). Visual tracking using transformer with a combination of convolution and attention. Image and Vision Computing, 137, 文章 104760. https://doi.org/10.1016/j.imavis.2023.104760

@article{bea3e4ee7db74b72956ea6e848e62b5c,

title = "Visual tracking using transformer with a combination of convolution and attention",

abstract = "For Siamese-based trackers in the field of single object tracking, cross-correlation operation plays an important role. However, the cross-correlation essentially uses target feature to locally linearly match the search region, which leads to insufficient utilization or even loss of feature information. To effectively employ global context and sufficiently explore the relevance of template and search region, a novel matching operator is designed inspired by Transformer, which uses multi-head attention and embed a designed modulation module across the inputs of operator. Meanwhile, we equip our tracker with a multi-scale encoder/decoder strategy to gradually make more precise tracking. Finally, a complete tracking framework is presented named VTTR. The tracker consists of a feature extractor, a multi-scale encoder based on depth-wise convolution, a modified decoder as the matching operator and a prediction head. The proposed tracker is tested on many benchmarks and achieve excellent performance while running with fast speed.",

keywords = "Attention, Siamese networks, Transformer, Visual tracking",

author = "Yuxuan Wang and Liping Yan and Zihang Feng and Yuanqing Xia and Bo Xiao",

note = "Publisher Copyright: {\textcopyright} 2023",

year = "2023",

month = sep,

doi = "10.1016/j.imavis.2023.104760",

language = "English",

volume = "137",

journal = "Image and Vision Computing",

issn = "0262-8856",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Visual tracking using transformer with a combination of convolution and attention

AU - Wang, Yuxuan

AU - Yan, Liping

AU - Feng, Zihang

AU - Xia, Yuanqing

AU - Xiao, Bo

PY - 2023/9

Y1 - 2023/9

N2 - For Siamese-based trackers in the field of single object tracking, cross-correlation operation plays an important role. However, the cross-correlation essentially uses target feature to locally linearly match the search region, which leads to insufficient utilization or even loss of feature information. To effectively employ global context and sufficiently explore the relevance of template and search region, a novel matching operator is designed inspired by Transformer, which uses multi-head attention and embed a designed modulation module across the inputs of operator. Meanwhile, we equip our tracker with a multi-scale encoder/decoder strategy to gradually make more precise tracking. Finally, a complete tracking framework is presented named VTTR. The tracker consists of a feature extractor, a multi-scale encoder based on depth-wise convolution, a modified decoder as the matching operator and a prediction head. The proposed tracker is tested on many benchmarks and achieve excellent performance while running with fast speed.

AB - For Siamese-based trackers in the field of single object tracking, cross-correlation operation plays an important role. However, the cross-correlation essentially uses target feature to locally linearly match the search region, which leads to insufficient utilization or even loss of feature information. To effectively employ global context and sufficiently explore the relevance of template and search region, a novel matching operator is designed inspired by Transformer, which uses multi-head attention and embed a designed modulation module across the inputs of operator. Meanwhile, we equip our tracker with a multi-scale encoder/decoder strategy to gradually make more precise tracking. Finally, a complete tracking framework is presented named VTTR. The tracker consists of a feature extractor, a multi-scale encoder based on depth-wise convolution, a modified decoder as the matching operator and a prediction head. The proposed tracker is tested on many benchmarks and achieve excellent performance while running with fast speed.

KW - Attention

KW - Siamese networks

KW - Transformer

KW - Visual tracking

UR - http://www.scopus.com/inward/record.url?scp=85165055806&partnerID=8YFLogxK

U2 - 10.1016/j.imavis.2023.104760

DO - 10.1016/j.imavis.2023.104760

M3 - Article

AN - SCOPUS:85165055806

SN - 0262-8856

VL - 137

JO - Image and Vision Computing

JF - Image and Vision Computing

M1 - 104760

ER -

Visual tracking using transformer with a combination of convolution and attention

摘要

访问文件

其它文件与链接

指纹

引用此