Unveil the potential of siamese framework for visual tracking

Xin Yang; Yong Song; Yufei Zhao; Zishuo Zhang; Chenyang Zhao

doi:10.1016/j.neucom.2022.09.028

Unveil the potential of siamese framework for visual tracking

Xin Yang, Yong Song^*, Yufei Zhao, Zishuo Zhang, Chenyang Zhao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

Original language	English
Pages (from-to)	204-214
Number of pages	11
Journal	Neurocomputing
Volume	513
DOIs	https://doi.org/10.1016/j.neucom.2022.09.028
Publication status	Published - 7 Nov 2022

Keywords

Siamese network
Similarity measuring
Transformer
Visual tracking

Access to Document

10.1016/j.neucom.2022.09.028

Cite this

@article{360e3bce161b42d282eb7092d9e9388b,

title = "Unveil the potential of siamese framework for visual tracking",

abstract = "Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.",

keywords = "Siamese network, Similarity measuring, Transformer, Visual tracking",

author = "Xin Yang and Yong Song and Yufei Zhao and Zishuo Zhang and Chenyang Zhao",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2022",

month = nov,

day = "7",

doi = "10.1016/j.neucom.2022.09.028",

language = "English",

volume = "513",

pages = "204--214",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Unveil the potential of siamese framework for visual tracking

AU - Yang, Xin

AU - Song, Yong

AU - Zhao, Yufei

AU - Zhang, Zishuo

AU - Zhao, Chenyang

PY - 2022/11/7

Y1 - 2022/11/7

N2 - Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

AB - Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.

KW - Siamese network

KW - Similarity measuring

KW - Transformer

KW - Visual tracking

UR - http://www.scopus.com/inward/record.url?scp=85139011529&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2022.09.028

DO - 10.1016/j.neucom.2022.09.028

M3 - Article

AN - SCOPUS:85139011529

SN - 0925-2312

VL - 513

SP - 204

EP - 214

JO - Neurocomputing

JF - Neurocomputing

ER -

Unveil the potential of siamese framework for visual tracking

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this