Occlusion and Deformation Handling Visual Tracking for UAV via Attention-Based Mask Generative Network

Yashuo Bai; Yong Song; Yufei Zhao; Ya Zhou; Xiyan Wu; Yuxin He; Zishuo Zhang; Xin Yang; Qun Hao

doi:10.3390/rs14194756

Occlusion and Deformation Handling Visual Tracking for UAV via Attention-Based Mask Generative Network

Yashuo Bai, Yong Song, Yufei Zhao^*, Ya Zhou, Xiyan Wu, Yuxin He, Zishuo Zhang, Xin Yang, Qun Hao

^*此作品的通讯作者

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

7 引用（Scopus）

摘要

Although the performance of unmanned aerial vehicle (UAV) tracking has benefited from the successful application of discriminative correlation filters (DCF) and convolutional neural networks (CNNs), UAV tracking under occlusion and deformation remains a challenge. The main dilemma is that challenging scenes, such as occlusion or deformation, are very complex and changeable, making it difficult to obtain training data covering all situations, resulting in trained networks that may be confused by new contexts that differ from historical information. Data-driven strategies are the main direction of current solutions, but gathering large-scale datasets with object instances under various occlusion and deformation conditions is difficult and lacks diversity. This paper proposes an attention-based mask generation network (AMGN) for UAV-specific tracking, which combines the attention mechanism and adversarial learning to improve the tracker’s ability to handle occlusion and deformation. After the base CNN extracts the deep features of the candidate region, a series of masks are determined by the spatial attention module and sent to the generator, and the generator discards some features according to these masks to simulate the occlusion and deformation of the object, producing more hard positive samples. The discriminator seeks to distinguish these hard positive samples while guiding mask generation. Such adversarial learning can effectively complement occluded and deformable positive samples in the feature space, allowing to capture more robust features to distinguish objects from backgrounds. Comparative experiments show that our AMGN-based tracker achieves the highest area under curve (AUC) of 0.490 and 0.349, and the highest precision scores of 0.742 and 0.662, on the UAV123 tracking benchmark with partial and full occlusion attributes, respectively. It also achieves the highest AUC of 0.555 and the highest precision score of 0.797 on the DTB70 tracking benchmark with the deformation attribute. On the UAVDT tracking benchmark with the large occlusion attribute, it achieves the highest AUC of 0.407 and the highest precision score of 0.582.

源语言	英语
文章编号	4756
期刊	Remote Sensing
卷	14
期	19
DOI	https://doi.org/10.3390/rs14194756
出版状态	已出版 - 10月 2022

访问文件

10.3390/rs14194756

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{5c9807f273b24f0bbf796d0e68b848eb,

title = "Occlusion and Deformation Handling Visual Tracking for UAV via Attention-Based Mask Generative Network",

abstract = "Although the performance of unmanned aerial vehicle (UAV) tracking has benefited from the successful application of discriminative correlation filters (DCF) and convolutional neural networks (CNNs), UAV tracking under occlusion and deformation remains a challenge. The main dilemma is that challenging scenes, such as occlusion or deformation, are very complex and changeable, making it difficult to obtain training data covering all situations, resulting in trained networks that may be confused by new contexts that differ from historical information. Data-driven strategies are the main direction of current solutions, but gathering large-scale datasets with object instances under various occlusion and deformation conditions is difficult and lacks diversity. This paper proposes an attention-based mask generation network (AMGN) for UAV-specific tracking, which combines the attention mechanism and adversarial learning to improve the tracker{\textquoteright}s ability to handle occlusion and deformation. After the base CNN extracts the deep features of the candidate region, a series of masks are determined by the spatial attention module and sent to the generator, and the generator discards some features according to these masks to simulate the occlusion and deformation of the object, producing more hard positive samples. The discriminator seeks to distinguish these hard positive samples while guiding mask generation. Such adversarial learning can effectively complement occluded and deformable positive samples in the feature space, allowing to capture more robust features to distinguish objects from backgrounds. Comparative experiments show that our AMGN-based tracker achieves the highest area under curve (AUC) of 0.490 and 0.349, and the highest precision scores of 0.742 and 0.662, on the UAV123 tracking benchmark with partial and full occlusion attributes, respectively. It also achieves the highest AUC of 0.555 and the highest precision score of 0.797 on the DTB70 tracking benchmark with the deformation attribute. On the UAVDT tracking benchmark with the large occlusion attribute, it achieves the highest AUC of 0.407 and the highest precision score of 0.582.",

keywords = "adversarial learning, attention mechanism, convolutional neural network, unmanned aerial vehicle, visual object tracking",

author = "Yashuo Bai and Yong Song and Yufei Zhao and Ya Zhou and Xiyan Wu and Yuxin He and Zishuo Zhang and Xin Yang and Qun Hao",

note = "Publisher Copyright: {\textcopyright} 2022 by the authors.",

year = "2022",

month = oct,

doi = "10.3390/rs14194756",

language = "English",

volume = "14",

journal = "Remote Sensing",

issn = "2072-4292",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "19",

}

TY - JOUR

T1 - Occlusion and Deformation Handling Visual Tracking for UAV via Attention-Based Mask Generative Network

AU - Bai, Yashuo

AU - Song, Yong

AU - Zhao, Yufei

AU - Zhou, Ya

AU - Wu, Xiyan

AU - He, Yuxin

AU - Zhang, Zishuo

AU - Yang, Xin

AU - Hao, Qun

PY - 2022/10

Y1 - 2022/10

N2 - Although the performance of unmanned aerial vehicle (UAV) tracking has benefited from the successful application of discriminative correlation filters (DCF) and convolutional neural networks (CNNs), UAV tracking under occlusion and deformation remains a challenge. The main dilemma is that challenging scenes, such as occlusion or deformation, are very complex and changeable, making it difficult to obtain training data covering all situations, resulting in trained networks that may be confused by new contexts that differ from historical information. Data-driven strategies are the main direction of current solutions, but gathering large-scale datasets with object instances under various occlusion and deformation conditions is difficult and lacks diversity. This paper proposes an attention-based mask generation network (AMGN) for UAV-specific tracking, which combines the attention mechanism and adversarial learning to improve the tracker’s ability to handle occlusion and deformation. After the base CNN extracts the deep features of the candidate region, a series of masks are determined by the spatial attention module and sent to the generator, and the generator discards some features according to these masks to simulate the occlusion and deformation of the object, producing more hard positive samples. The discriminator seeks to distinguish these hard positive samples while guiding mask generation. Such adversarial learning can effectively complement occluded and deformable positive samples in the feature space, allowing to capture more robust features to distinguish objects from backgrounds. Comparative experiments show that our AMGN-based tracker achieves the highest area under curve (AUC) of 0.490 and 0.349, and the highest precision scores of 0.742 and 0.662, on the UAV123 tracking benchmark with partial and full occlusion attributes, respectively. It also achieves the highest AUC of 0.555 and the highest precision score of 0.797 on the DTB70 tracking benchmark with the deformation attribute. On the UAVDT tracking benchmark with the large occlusion attribute, it achieves the highest AUC of 0.407 and the highest precision score of 0.582.

AB - Although the performance of unmanned aerial vehicle (UAV) tracking has benefited from the successful application of discriminative correlation filters (DCF) and convolutional neural networks (CNNs), UAV tracking under occlusion and deformation remains a challenge. The main dilemma is that challenging scenes, such as occlusion or deformation, are very complex and changeable, making it difficult to obtain training data covering all situations, resulting in trained networks that may be confused by new contexts that differ from historical information. Data-driven strategies are the main direction of current solutions, but gathering large-scale datasets with object instances under various occlusion and deformation conditions is difficult and lacks diversity. This paper proposes an attention-based mask generation network (AMGN) for UAV-specific tracking, which combines the attention mechanism and adversarial learning to improve the tracker’s ability to handle occlusion and deformation. After the base CNN extracts the deep features of the candidate region, a series of masks are determined by the spatial attention module and sent to the generator, and the generator discards some features according to these masks to simulate the occlusion and deformation of the object, producing more hard positive samples. The discriminator seeks to distinguish these hard positive samples while guiding mask generation. Such adversarial learning can effectively complement occluded and deformable positive samples in the feature space, allowing to capture more robust features to distinguish objects from backgrounds. Comparative experiments show that our AMGN-based tracker achieves the highest area under curve (AUC) of 0.490 and 0.349, and the highest precision scores of 0.742 and 0.662, on the UAV123 tracking benchmark with partial and full occlusion attributes, respectively. It also achieves the highest AUC of 0.555 and the highest precision score of 0.797 on the DTB70 tracking benchmark with the deformation attribute. On the UAVDT tracking benchmark with the large occlusion attribute, it achieves the highest AUC of 0.407 and the highest precision score of 0.582.

KW - adversarial learning

KW - attention mechanism

KW - convolutional neural network

KW - unmanned aerial vehicle

KW - visual object tracking

UR - http://www.scopus.com/inward/record.url?scp=85140008007&partnerID=8YFLogxK

U2 - 10.3390/rs14194756

DO - 10.3390/rs14194756

M3 - Article

AN - SCOPUS:85140008007

SN - 2072-4292

VL - 14

JO - Remote Sensing

JF - Remote Sensing

IS - 19

M1 - 4756

ER -

Occlusion and Deformation Handling Visual Tracking for UAV via Attention-Based Mask Generative Network

摘要

访问文件

其它文件与链接

指纹

引用此