基于异构哈希网络的跨模态人脸检索方法

Zhen Dong; Ming Tao Pei

doi:10.11897/SP.J.1016.2019.00073

基于异构哈希网络的跨模态人脸检索方法

Translated title of the contribution: Cross-Modality Face Retrieval Based on Heterogeneous Hashing Network

Zhen Dong, Ming Tao Pei^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Cross-modality face retrieval is to retrieve faces of a particular person in one modality given his/her face information in another modality, such as retrieving video shots containing particular person given one image of him/her (query-by-image video retrieval), or retrieving the face images of one person by using his/her video clip as query (query-by-video image retrieval). It is an important problem in computer vision with wide range of applications. Take the criminal investment as an example. The "query-by-image video retrieval" task plays an important role in rapid locating and tracking a suspect from masses of surveillance videos with the ID card, passport, or driver license photo of the suspect as query. The "query-by-video image retrieval" task helps to determine the identity of an unknown suspect by retrieving a huge mug-shot image database given his/her video shot taken by the surveillance cameras in the crime scene. In this paper, we present a cross-modality face retrieval method which uses the heterogeneous hashing network to generate effective and compact hash representations for both face images and face videos. The network contains an image branch and a video branch to project face images and videos into a common discriminative space, respectively. Each channel are equipped with two modules: feature extractor module and non-linear mapping module. The feature extractor modules aim to represent face images or videos via appropriate features, and the non-linear mapping modules are designed to transform the heterogenous image and video feature spaces into the common space. On the common space, the similarity between a face image and a face video can be measured through the distance of their corresponding discriminative features, but these features are still high-dimensional vectors of floating point numbers, which cannot satisfy the requirements of low computation and storage complexities in the retrieval task. The non-linear hash functions are thus learned in the common space to obtain the corresponding binary hash representations. To catch the compatability and the effectiveness of the branches and the hash functions, the heterogeneous hashing network is trained with three loss functions: Fisher loss, softmax loss, and triplet ranking loss. Our Fisher loss uses the difference form of the inter-class and the intra-class scatter where the mean vectors are learnable variables, which is feasible for the mini-batch based optimization method. The Fisher loss and the softmax loss are jointly exploited to enhance the discriminative power of the common space. The triplet ranking loss is enforced to the final binary space for the improvement of the retrieval performance. Experiments on a large-scale face video dataset and two challenging TV-series datasets demonstrate the effectiveness of the proposed method. The contributions of the paper are three-folds: (1) We propose an effective cross-modality face retrieval method based on the heterogeneous hashing network. Our network is able to generate isomorphic discriminative compact binary representations of both face images and videos. (2) The proposed heterogeneous hashing network provides a general framework for deep learning based cross-modality hashing methods, and can be easily adopted in many other cross-modality retrieval tasks. (3) The proposed method achieves excellent results of face retrieval across image and video modalities on a large scale face video dataset and two challenging TV-series datasets.

Translated title of the contribution	Cross-Modality Face Retrieval Based on Heterogeneous Hashing Network
Original language	Chinese (Traditional)
Pages (from-to)	73-84
Number of pages	12
Journal	Jisuanji Xuebao/Chinese Journal of Computers
Volume	42
Issue number	1
DOIs	https://doi.org/10.11897/SP.J.1016.2019.00073
Publication status	Published - 1 Jan 2019

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.11897/SP.J.1016.2019.00073

Cite this

@article{a960babc91aa4b62a91c722e75c88814,

title = "基于异构哈希网络的跨模态人脸检索方法",

abstract = "Cross-modality face retrieval is to retrieve faces of a particular person in one modality given his/her face information in another modality, such as retrieving video shots containing particular person given one image of him/her (query-by-image video retrieval), or retrieving the face images of one person by using his/her video clip as query (query-by-video image retrieval). It is an important problem in computer vision with wide range of applications. Take the criminal investment as an example. The {"}query-by-image video retrieval{"} task plays an important role in rapid locating and tracking a suspect from masses of surveillance videos with the ID card, passport, or driver license photo of the suspect as query. The {"}query-by-video image retrieval{"} task helps to determine the identity of an unknown suspect by retrieving a huge mug-shot image database given his/her video shot taken by the surveillance cameras in the crime scene. In this paper, we present a cross-modality face retrieval method which uses the heterogeneous hashing network to generate effective and compact hash representations for both face images and face videos. The network contains an image branch and a video branch to project face images and videos into a common discriminative space, respectively. Each channel are equipped with two modules: feature extractor module and non-linear mapping module. The feature extractor modules aim to represent face images or videos via appropriate features, and the non-linear mapping modules are designed to transform the heterogenous image and video feature spaces into the common space. On the common space, the similarity between a face image and a face video can be measured through the distance of their corresponding discriminative features, but these features are still high-dimensional vectors of floating point numbers, which cannot satisfy the requirements of low computation and storage complexities in the retrieval task. The non-linear hash functions are thus learned in the common space to obtain the corresponding binary hash representations. To catch the compatability and the effectiveness of the branches and the hash functions, the heterogeneous hashing network is trained with three loss functions: Fisher loss, softmax loss, and triplet ranking loss. Our Fisher loss uses the difference form of the inter-class and the intra-class scatter where the mean vectors are learnable variables, which is feasible for the mini-batch based optimization method. The Fisher loss and the softmax loss are jointly exploited to enhance the discriminative power of the common space. The triplet ranking loss is enforced to the final binary space for the improvement of the retrieval performance. Experiments on a large-scale face video dataset and two challenging TV-series datasets demonstrate the effectiveness of the proposed method. The contributions of the paper are three-folds: (1) We propose an effective cross-modality face retrieval method based on the heterogeneous hashing network. Our network is able to generate isomorphic discriminative compact binary representations of both face images and videos. (2) The proposed heterogeneous hashing network provides a general framework for deep learning based cross-modality hashing methods, and can be easily adopted in many other cross-modality retrieval tasks. (3) The proposed method achieves excellent results of face retrieval across image and video modalities on a large scale face video dataset and two challenging TV-series datasets.",

keywords = "Cross-modality, Deep learning, Face retrieval, Heterogeneous hashing network, Loss function",

author = "Zhen Dong and Pei, {Ming Tao}",

year = "2019",

month = jan,

day = "1",

doi = "10.11897/SP.J.1016.2019.00073",

language = "繁体中文",

volume = "42",

pages = "73--84",

journal = "Jisuanji Xuebao/Chinese Journal of Computers",

issn = "0254-4164",

publisher = "Science Press",

number = "1",

}

TY - JOUR

T1 - 基于异构哈希网络的跨模态人脸检索方法

AU - Dong, Zhen

AU - Pei, Ming Tao

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Cross-modality face retrieval is to retrieve faces of a particular person in one modality given his/her face information in another modality, such as retrieving video shots containing particular person given one image of him/her (query-by-image video retrieval), or retrieving the face images of one person by using his/her video clip as query (query-by-video image retrieval). It is an important problem in computer vision with wide range of applications. Take the criminal investment as an example. The "query-by-image video retrieval" task plays an important role in rapid locating and tracking a suspect from masses of surveillance videos with the ID card, passport, or driver license photo of the suspect as query. The "query-by-video image retrieval" task helps to determine the identity of an unknown suspect by retrieving a huge mug-shot image database given his/her video shot taken by the surveillance cameras in the crime scene. In this paper, we present a cross-modality face retrieval method which uses the heterogeneous hashing network to generate effective and compact hash representations for both face images and face videos. The network contains an image branch and a video branch to project face images and videos into a common discriminative space, respectively. Each channel are equipped with two modules: feature extractor module and non-linear mapping module. The feature extractor modules aim to represent face images or videos via appropriate features, and the non-linear mapping modules are designed to transform the heterogenous image and video feature spaces into the common space. On the common space, the similarity between a face image and a face video can be measured through the distance of their corresponding discriminative features, but these features are still high-dimensional vectors of floating point numbers, which cannot satisfy the requirements of low computation and storage complexities in the retrieval task. The non-linear hash functions are thus learned in the common space to obtain the corresponding binary hash representations. To catch the compatability and the effectiveness of the branches and the hash functions, the heterogeneous hashing network is trained with three loss functions: Fisher loss, softmax loss, and triplet ranking loss. Our Fisher loss uses the difference form of the inter-class and the intra-class scatter where the mean vectors are learnable variables, which is feasible for the mini-batch based optimization method. The Fisher loss and the softmax loss are jointly exploited to enhance the discriminative power of the common space. The triplet ranking loss is enforced to the final binary space for the improvement of the retrieval performance. Experiments on a large-scale face video dataset and two challenging TV-series datasets demonstrate the effectiveness of the proposed method. The contributions of the paper are three-folds: (1) We propose an effective cross-modality face retrieval method based on the heterogeneous hashing network. Our network is able to generate isomorphic discriminative compact binary representations of both face images and videos. (2) The proposed heterogeneous hashing network provides a general framework for deep learning based cross-modality hashing methods, and can be easily adopted in many other cross-modality retrieval tasks. (3) The proposed method achieves excellent results of face retrieval across image and video modalities on a large scale face video dataset and two challenging TV-series datasets.

AB - Cross-modality face retrieval is to retrieve faces of a particular person in one modality given his/her face information in another modality, such as retrieving video shots containing particular person given one image of him/her (query-by-image video retrieval), or retrieving the face images of one person by using his/her video clip as query (query-by-video image retrieval). It is an important problem in computer vision with wide range of applications. Take the criminal investment as an example. The "query-by-image video retrieval" task plays an important role in rapid locating and tracking a suspect from masses of surveillance videos with the ID card, passport, or driver license photo of the suspect as query. The "query-by-video image retrieval" task helps to determine the identity of an unknown suspect by retrieving a huge mug-shot image database given his/her video shot taken by the surveillance cameras in the crime scene. In this paper, we present a cross-modality face retrieval method which uses the heterogeneous hashing network to generate effective and compact hash representations for both face images and face videos. The network contains an image branch and a video branch to project face images and videos into a common discriminative space, respectively. Each channel are equipped with two modules: feature extractor module and non-linear mapping module. The feature extractor modules aim to represent face images or videos via appropriate features, and the non-linear mapping modules are designed to transform the heterogenous image and video feature spaces into the common space. On the common space, the similarity between a face image and a face video can be measured through the distance of their corresponding discriminative features, but these features are still high-dimensional vectors of floating point numbers, which cannot satisfy the requirements of low computation and storage complexities in the retrieval task. The non-linear hash functions are thus learned in the common space to obtain the corresponding binary hash representations. To catch the compatability and the effectiveness of the branches and the hash functions, the heterogeneous hashing network is trained with three loss functions: Fisher loss, softmax loss, and triplet ranking loss. Our Fisher loss uses the difference form of the inter-class and the intra-class scatter where the mean vectors are learnable variables, which is feasible for the mini-batch based optimization method. The Fisher loss and the softmax loss are jointly exploited to enhance the discriminative power of the common space. The triplet ranking loss is enforced to the final binary space for the improvement of the retrieval performance. Experiments on a large-scale face video dataset and two challenging TV-series datasets demonstrate the effectiveness of the proposed method. The contributions of the paper are three-folds: (1) We propose an effective cross-modality face retrieval method based on the heterogeneous hashing network. Our network is able to generate isomorphic discriminative compact binary representations of both face images and videos. (2) The proposed heterogeneous hashing network provides a general framework for deep learning based cross-modality hashing methods, and can be easily adopted in many other cross-modality retrieval tasks. (3) The proposed method achieves excellent results of face retrieval across image and video modalities on a large scale face video dataset and two challenging TV-series datasets.

KW - Cross-modality

KW - Deep learning

KW - Face retrieval

KW - Heterogeneous hashing network

KW - Loss function

UR - http://www.scopus.com/inward/record.url?scp=85064515574&partnerID=8YFLogxK

U2 - 10.11897/SP.J.1016.2019.00073

DO - 10.11897/SP.J.1016.2019.00073

M3 - 文章

AN - SCOPUS:85064515574

SN - 0254-4164

VL - 42

SP - 73

EP - 84

JO - Jisuanji Xuebao/Chinese Journal of Computers

JF - Jisuanji Xuebao/Chinese Journal of Computers

IS - 1

ER -

基于异构哈希网络的跨模态人脸检索方法

Abstract

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this