Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Jing Ge; Qianxiang Wang; Guangyu Gao

doi:10.1007/s00530-022-00914-w

Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Jing Ge, Qianxiang Wang, Guangyu Gao^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Searching persons in large-scale image databases with the query of natural language description is a more practical and important application in video surveillance. Intuitively, for person search, the core issue should be the visual–textual association, which is still an extremely challenging task, due to the contradiction between the high abstraction of textual description and the intuitive expression of visual images. In this paper, aim for more consistent visual–textual features and better inter-class discriminate ability, we propose a text-based person search approach with visual–textual attention on the hardest and semi-hard negative pairs mining. First, for the visual and textual attentions, we designed a Smoothed Global Maximum Pooling (SGMP) to extract more concentrated visual features, and also the memory attention based on LSTM’s cell unit for more strictly correspondence matching. Second, while we only have labeled positive pairs, more valuable negative pairs are mined by defining the cross-modality-based hardest and semi-hard negative pairs. After that, we combine the triplet loss on the single modality with the hardest negative pairs, and the cross-entropy loss on cross-modalities with both the hardest and semi-hard negative pairs, to train the whole network. Finally, to evaluate the effectiveness and feasibility of the proposed approach, we conduct extensive experiments on the typical person search dataset: CUHK-PEDES, in which our approach achieves satisfactory performance, e.g, the top1 accuracy of 55.32 %. Besides, we also evaluate the semi-hard pair mining method in the COCO caption dataset and validate its effectiveness and complementary.

Original language	English
Pages (from-to)	3081-3093
Number of pages	13
Journal	Multimedia Systems
Volume	29
Issue number	5
DOIs	https://doi.org/10.1007/s00530-022-00914-w
Publication status	Published - Oct 2023

Keywords

Attention
Hard example mining
Person search
Visual–textual association

Access to Document

10.1007/s00530-022-00914-w

Cite this

@article{8d1a6a7dd1c14a42adeebc7c767b8d24,

title = "Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention",

abstract = "Searching persons in large-scale image databases with the query of natural language description is a more practical and important application in video surveillance. Intuitively, for person search, the core issue should be the visual–textual association, which is still an extremely challenging task, due to the contradiction between the high abstraction of textual description and the intuitive expression of visual images. In this paper, aim for more consistent visual–textual features and better inter-class discriminate ability, we propose a text-based person search approach with visual–textual attention on the hardest and semi-hard negative pairs mining. First, for the visual and textual attentions, we designed a Smoothed Global Maximum Pooling (SGMP) to extract more concentrated visual features, and also the memory attention based on LSTM{\textquoteright}s cell unit for more strictly correspondence matching. Second, while we only have labeled positive pairs, more valuable negative pairs are mined by defining the cross-modality-based hardest and semi-hard negative pairs. After that, we combine the triplet loss on the single modality with the hardest negative pairs, and the cross-entropy loss on cross-modalities with both the hardest and semi-hard negative pairs, to train the whole network. Finally, to evaluate the effectiveness and feasibility of the proposed approach, we conduct extensive experiments on the typical person search dataset: CUHK-PEDES, in which our approach achieves satisfactory performance, e.g, the top1 accuracy of 55.32 %. Besides, we also evaluate the semi-hard pair mining method in the COCO caption dataset and validate its effectiveness and complementary.",

keywords = "Attention, Hard example mining, Person search, Visual–textual association",

author = "Jing Ge and Qianxiang Wang and Guangyu Gao",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2023",

month = oct,

doi = "10.1007/s00530-022-00914-w",

language = "English",

volume = "29",

pages = "3081--3093",

journal = "Multimedia Systems",

issn = "0942-4962",

publisher = "Springer Verlag",

number = "5",

}

TY - JOUR

T1 - Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

AU - Ge, Jing

AU - Wang, Qianxiang

AU - Gao, Guangyu

PY - 2023/10

Y1 - 2023/10

N2 - Searching persons in large-scale image databases with the query of natural language description is a more practical and important application in video surveillance. Intuitively, for person search, the core issue should be the visual–textual association, which is still an extremely challenging task, due to the contradiction between the high abstraction of textual description and the intuitive expression of visual images. In this paper, aim for more consistent visual–textual features and better inter-class discriminate ability, we propose a text-based person search approach with visual–textual attention on the hardest and semi-hard negative pairs mining. First, for the visual and textual attentions, we designed a Smoothed Global Maximum Pooling (SGMP) to extract more concentrated visual features, and also the memory attention based on LSTM’s cell unit for more strictly correspondence matching. Second, while we only have labeled positive pairs, more valuable negative pairs are mined by defining the cross-modality-based hardest and semi-hard negative pairs. After that, we combine the triplet loss on the single modality with the hardest negative pairs, and the cross-entropy loss on cross-modalities with both the hardest and semi-hard negative pairs, to train the whole network. Finally, to evaluate the effectiveness and feasibility of the proposed approach, we conduct extensive experiments on the typical person search dataset: CUHK-PEDES, in which our approach achieves satisfactory performance, e.g, the top1 accuracy of 55.32 %. Besides, we also evaluate the semi-hard pair mining method in the COCO caption dataset and validate its effectiveness and complementary.

AB - Searching persons in large-scale image databases with the query of natural language description is a more practical and important application in video surveillance. Intuitively, for person search, the core issue should be the visual–textual association, which is still an extremely challenging task, due to the contradiction between the high abstraction of textual description and the intuitive expression of visual images. In this paper, aim for more consistent visual–textual features and better inter-class discriminate ability, we propose a text-based person search approach with visual–textual attention on the hardest and semi-hard negative pairs mining. First, for the visual and textual attentions, we designed a Smoothed Global Maximum Pooling (SGMP) to extract more concentrated visual features, and also the memory attention based on LSTM’s cell unit for more strictly correspondence matching. Second, while we only have labeled positive pairs, more valuable negative pairs are mined by defining the cross-modality-based hardest and semi-hard negative pairs. After that, we combine the triplet loss on the single modality with the hardest negative pairs, and the cross-entropy loss on cross-modalities with both the hardest and semi-hard negative pairs, to train the whole network. Finally, to evaluate the effectiveness and feasibility of the proposed approach, we conduct extensive experiments on the typical person search dataset: CUHK-PEDES, in which our approach achieves satisfactory performance, e.g, the top1 accuracy of 55.32 %. Besides, we also evaluate the semi-hard pair mining method in the COCO caption dataset and validate its effectiveness and complementary.

KW - Attention

KW - Hard example mining

KW - Person search

KW - Visual–textual association

UR - http://www.scopus.com/inward/record.url?scp=85127254075&partnerID=8YFLogxK

U2 - 10.1007/s00530-022-00914-w

DO - 10.1007/s00530-022-00914-w

M3 - Article

AN - SCOPUS:85127254075

SN - 0942-4962

VL - 29

SP - 3081

EP - 3093

JO - Multimedia Systems

JF - Multimedia Systems

IS - 5

ER -

Hardest and semi-hard negative pairs mining for text-based person search with visual–textual attention

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this