深 度 学 习 单 目 标 跟 踪 方 法 的 基 础 架 构 研 究 进 展

Tingfa Xu*, Ying Wang, Guokai Shi, Tianhao Li, Jianan Li*

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

Significance Single object tracking (SOT) is one of the fundamental problems in computer vision, which has received extensive attention from scholars and industry professionals worldwide due to its important applications in intelligent video surveillance, human-computer interaction, autonomous driving, military target analysis, and other fields. For a given video sequence, a SOT method needs to predict the real-time and accurate location and size of the target in subsequent frames based on the initial state of the target (usually represented by the target bounding box) in the first frame. Unlike object detection, the tracking target in the tracking task is not specified by any specific category, and the tracking scene is always complex and diverse, involving many challenges such as changes in target scales, target occlusion, motion blur, and target disappearance. Therefore, tracking targets in real-time, accurately, and robustly is an extremely challenging task. The mainstream object tracking methods can be divided into three categories: discriminative correlation filters-based tracking methods, Siamese network-based tracking methods, and Transformer-based tracking methods. Among them, the accuracy and robustness of discirminative correlation filter (DCF) are far below the actual requirements. Meanwhile, with the advancement of deep learning hardware, the advantage of DCF methods being able to run in real time on mobile devices no longer exists. On the contrary, deep learning techniques have rapidly developed in recent years with the continuous improvement of computer performance and dataset capacity. Among them, deep learning theory, deep backbone networks, attention mechanisms, and self-supervised learning techniques have played a powerful role in the development of object tracking methods. Deep learning-based SOT methods can make full use of large-scale datasets for end-to-end offline training to achieve real-time, accurate, and robust tracking. Therefore, we provide an overview of deep learning-based object tracking methods. Some review works on tracking methods already exist, but the presentation of Transformer-based tracking methods is absent. Therefore, based on the existing work, we introduce the latest achievements in the field. Meanwhile, in contrast to the existing work, we innovatively divide tracking methods into two categories according to the type of architecture, i. e., Siamese network-based two-stream tracking method and Transformer-based one-stream tracking method. We also provide a comprehensive and detailed analysis of these two basic architectures, focusing on their principles, components, limitations, and development directions. In addition, the dataset is the cornerstone of the method training and evaluation. We summarize the current mainstream deep learning-based SOT datasets, elaborate on the evaluation methods and evaluation metrics of tracking methods on the datasets, and summarize the performance of various methods on the datasets. Finally, we analyze the future development trend of video target tracking methods from a macro perspective, so as to provide a reference for researchers. Progress Deep learning-based target tracking methods can be divided into two categories according to the architecture type, namely the Siamese network-based two-stream tracking method and the Transformer-based one-stream tracking method. The essential difference between the two architectures is that the two-stream method uses a Siamese networkshaped backbone network for feature extraction and a separate feature fusion module for feature fusion, while the one-stream method uses a single-stream backbone network for both feature extraction and fusion. The Siamese network-based two-stream tracking method constructs the tracking task as a similarity matching problem between the target template and the search region, consisting of three basic modules: feature extraction, feature fusion, and tracking head. The method process is as follows: The weight-shared two-stream backbone network extracts the features of the target template and the search region respectively. The two features are fused for information interaction and input to the tracking head to output the target position. In the subsequent improvements of the method, the feature extraction module is from shallow to deep; the feature fusion module is from coarse to fine, and the tracking head module is from complex to simple. In addition, the performance of the method in complex backgrounds is gradually improved. The Transformer-based one-stream tracking method first splits and flattens the target template and search frame into sequences of patches. These patches of features are embedded with learnable position embedding and fed into a Transformer backbone network, which allows feature extraction and feature fusion at the same time. The feature fusion operation continues throughout the backbone network, resulting in a network that outputs the target-specified search features. Compared with two-stream networks, one-stream networks are simple in structure and do not require prior knowledge about the task. This task-independent network facilitates the construction of general-purpose neural network architectures for multiple tasks. Meanwhile, the pre-training technique further improves the performance of the one-stream method. Experimental results demonstrate that the pre-trained model based on masked image modeling optimizes the method. Conclusions and Prospects One-stream tracking method with a simple structure and powerful learning and modeling capability is the trend of future target tracking method research. Meanwhile, collaborative multi-task tracking, multimodal tracking, scenario-specific target tracking, unsupervised target tracking methods, etc. have strong applications and demands.

投稿的翻译标题Research Progress in Fundamental Architecture of Deep Learning-Based Single Object Tracking Method
源语言繁体中文
文章编号1510003
期刊Guangxue Xuebao/Acta Optica Sinica
43
15
DOI
出版状态已出版 - 8月 2023

关键词

  • Siamese network
  • Transformer
  • deep learning
  • deep learningbased object tracking
  • single object tracking

指纹

探究 '深 度 学 习 单 目 标 跟 踪 方 法 的 基 础 架 构 研 究 进 展' 的科研主题。它们共同构成独一无二的指纹。

引用此