Video person re-identification with global statistic pooling and self-attention distillation

Gaojie Lin; Sanyuan Zhao; Jianbing Shen

doi:10.1016/j.neucom.2020.05.111

Video person re-identification with global statistic pooling and self-attention distillation

Gaojie Lin, Sanyuan Zhao^*, Jianbing Shen

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

10 引用（Scopus）

摘要

Most existing methods for video person re-identification apply spatial-temporal global average or attention pooling to aggregate frame-level feature and get video-level feature. The obtained video-level feature models only the first-order statistics of the appearance feature from holistic video, resulting in limited representation capability of the feature network. In this paper, we propose a novel Global Statistic Pooling network (GSPnet) which takes full advantage of the second-order information for enhancing modeling capability. Firstly, a novel global statistic pooling module is proposed to summarize both the first- and second-order statistics across frame-level feature, and then transfer them into a compact and robust video-level feature embedding. Secondly, a statistic-based attention block is incorporated into multiple stages of convolutional networks to fully explore the second-order representations from low- to high-level features. To enhance the representation learning ability and further boost re-identification (re-ID) performance, we also propose a multi-level self-attention distillation training scheme, which squeezes the knowledge learned in the deeper portion of the networks into the shallow ones. Extensive experimental results have demonstrated the effectiveness and superiority of our approach on four popular video person re-ID datasets.

源语言	英语
页（从-至）	777-789
页数	13
期刊	Neurocomputing
卷	453
DOI	https://doi.org/10.1016/j.neucom.2020.05.111
出版状态	已出版 - 17 9月 2021

访问文件

10.1016/j.neucom.2020.05.111

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ade9ea5648c448929a09416d7ef5d6c7,

title = "Video person re-identification with global statistic pooling and self-attention distillation",

abstract = "Most existing methods for video person re-identification apply spatial-temporal global average or attention pooling to aggregate frame-level feature and get video-level feature. The obtained video-level feature models only the first-order statistics of the appearance feature from holistic video, resulting in limited representation capability of the feature network. In this paper, we propose a novel Global Statistic Pooling network (GSPnet) which takes full advantage of the second-order information for enhancing modeling capability. Firstly, a novel global statistic pooling module is proposed to summarize both the first- and second-order statistics across frame-level feature, and then transfer them into a compact and robust video-level feature embedding. Secondly, a statistic-based attention block is incorporated into multiple stages of convolutional networks to fully explore the second-order representations from low- to high-level features. To enhance the representation learning ability and further boost re-identification (re-ID) performance, we also propose a multi-level self-attention distillation training scheme, which squeezes the knowledge learned in the deeper portion of the networks into the shallow ones. Extensive experimental results have demonstrated the effectiveness and superiority of our approach on four popular video person re-ID datasets.",

keywords = "Attention mechanism, Higher-order pooling, Person re-identification, Video re-identification",

author = "Gaojie Lin and Sanyuan Zhao and Jianbing Shen",

note = "Publisher Copyright: {\textcopyright} 2020 Elsevier B.V.",

year = "2021",

month = sep,

day = "17",

doi = "10.1016/j.neucom.2020.05.111",

language = "English",

volume = "453",

pages = "777--789",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Video person re-identification with global statistic pooling and self-attention distillation

AU - Lin, Gaojie

AU - Zhao, Sanyuan

AU - Shen, Jianbing

PY - 2021/9/17

Y1 - 2021/9/17

N2 - Most existing methods for video person re-identification apply spatial-temporal global average or attention pooling to aggregate frame-level feature and get video-level feature. The obtained video-level feature models only the first-order statistics of the appearance feature from holistic video, resulting in limited representation capability of the feature network. In this paper, we propose a novel Global Statistic Pooling network (GSPnet) which takes full advantage of the second-order information for enhancing modeling capability. Firstly, a novel global statistic pooling module is proposed to summarize both the first- and second-order statistics across frame-level feature, and then transfer them into a compact and robust video-level feature embedding. Secondly, a statistic-based attention block is incorporated into multiple stages of convolutional networks to fully explore the second-order representations from low- to high-level features. To enhance the representation learning ability and further boost re-identification (re-ID) performance, we also propose a multi-level self-attention distillation training scheme, which squeezes the knowledge learned in the deeper portion of the networks into the shallow ones. Extensive experimental results have demonstrated the effectiveness and superiority of our approach on four popular video person re-ID datasets.

AB - Most existing methods for video person re-identification apply spatial-temporal global average or attention pooling to aggregate frame-level feature and get video-level feature. The obtained video-level feature models only the first-order statistics of the appearance feature from holistic video, resulting in limited representation capability of the feature network. In this paper, we propose a novel Global Statistic Pooling network (GSPnet) which takes full advantage of the second-order information for enhancing modeling capability. Firstly, a novel global statistic pooling module is proposed to summarize both the first- and second-order statistics across frame-level feature, and then transfer them into a compact and robust video-level feature embedding. Secondly, a statistic-based attention block is incorporated into multiple stages of convolutional networks to fully explore the second-order representations from low- to high-level features. To enhance the representation learning ability and further boost re-identification (re-ID) performance, we also propose a multi-level self-attention distillation training scheme, which squeezes the knowledge learned in the deeper portion of the networks into the shallow ones. Extensive experimental results have demonstrated the effectiveness and superiority of our approach on four popular video person re-ID datasets.

KW - Attention mechanism

KW - Higher-order pooling

KW - Person re-identification

KW - Video re-identification

UR - http://www.scopus.com/inward/record.url?scp=85092642427&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2020.05.111

DO - 10.1016/j.neucom.2020.05.111

M3 - Article

AN - SCOPUS:85092642427

SN - 0925-2312

VL - 453

SP - 777

EP - 789

JO - Neurocomputing

JF - Neurocomputing

ER -

Video person re-identification with global statistic pooling and self-attention distillation

摘要

访问文件

其它文件与链接

指纹

引用此