Abstract
Most existing methods for video person re-identification apply spatial-temporal global average or attention pooling to aggregate frame-level feature and get video-level feature. The obtained video-level feature models only the first-order statistics of the appearance feature from holistic video, resulting in limited representation capability of the feature network. In this paper, we propose a novel Global Statistic Pooling network (GSPnet) which takes full advantage of the second-order information for enhancing modeling capability. Firstly, a novel global statistic pooling module is proposed to summarize both the first- and second-order statistics across frame-level feature, and then transfer them into a compact and robust video-level feature embedding. Secondly, a statistic-based attention block is incorporated into multiple stages of convolutional networks to fully explore the second-order representations from low- to high-level features. To enhance the representation learning ability and further boost re-identification (re-ID) performance, we also propose a multi-level self-attention distillation training scheme, which squeezes the knowledge learned in the deeper portion of the networks into the shallow ones. Extensive experimental results have demonstrated the effectiveness and superiority of our approach on four popular video person re-ID datasets.
Original language | English |
---|---|
Pages (from-to) | 777-789 |
Number of pages | 13 |
Journal | Neurocomputing |
Volume | 453 |
DOIs | |
Publication status | Published - 17 Sept 2021 |
Keywords
- Attention mechanism
- Higher-order pooling
- Person re-identification
- Video re-identification