Unveil the potential of siamese framework for visual tracking

Xin Yang; Yong Song; Yufei Zhao; Zishuo Zhang; Chenyang Zhao

doi:10.1016/j.neucom.2022.09.028

Unveil the potential of siamese framework for visual tracking

Xin Yang, Yong Song^*, Yufei Zhao, Zishuo Zhang, Chenyang Zhao

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

4 引用（Scopus）

摘要

Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

源语言	英语
页（从-至）	204-214
页数	11
期刊	Neurocomputing
卷	513
DOI	https://doi.org/10.1016/j.neucom.2022.09.028
出版状态	已出版 - 7 11月 2022

访问文件

10.1016/j.neucom.2022.09.028

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, X., Song, Y., Zhao, Y., Zhang, Z., & Zhao, C. (2022). Unveil the potential of siamese framework for visual tracking. Neurocomputing, 513, 204-214. https://doi.org/10.1016/j.neucom.2022.09.028

@article{360e3bce161b42d282eb7092d9e9388b,

title = "Unveil the potential of siamese framework for visual tracking",

abstract = "Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.",

keywords = "Siamese network, Similarity measuring, Transformer, Visual tracking",

author = "Xin Yang and Yong Song and Yufei Zhao and Zishuo Zhang and Chenyang Zhao",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2022",

month = nov,

day = "7",

doi = "10.1016/j.neucom.2022.09.028",

language = "English",

volume = "513",

pages = "204--214",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Unveil the potential of siamese framework for visual tracking

AU - Yang, Xin

AU - Song, Yong

AU - Zhao, Yufei

AU - Zhang, Zishuo

AU - Zhao, Chenyang

PY - 2022/11/7

Y1 - 2022/11/7

N2 - Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

AB - Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

KW - Siamese network

KW - Similarity measuring

KW - Transformer

KW - Visual tracking

UR - http://www.scopus.com/inward/record.url?scp=85139011529&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2022.09.028

DO - 10.1016/j.neucom.2022.09.028

M3 - Article

AN - SCOPUS:85139011529

SN - 0925-2312

VL - 513

SP - 204

EP - 214

JO - Neurocomputing

JF - Neurocomputing

ER -

Unveil the potential of siamese framework for visual tracking

摘要

访问文件

其它文件与链接

指纹

引用此