Abstract
Cross-modality face retrieval is to retrieve faces of a particular person in one modality given his/her face information in another modality, such as retrieving video shots containing particular person given one image of him/her (query-by-image video retrieval), or retrieving the face images of one person by using his/her video clip as query (query-by-video image retrieval). It is an important problem in computer vision with wide range of applications. Take the criminal investment as an example. The "query-by-image video retrieval" task plays an important role in rapid locating and tracking a suspect from masses of surveillance videos with the ID card, passport, or driver license photo of the suspect as query. The "query-by-video image retrieval" task helps to determine the identity of an unknown suspect by retrieving a huge mug-shot image database given his/her video shot taken by the surveillance cameras in the crime scene. In this paper, we present a cross-modality face retrieval method which uses the heterogeneous hashing network to generate effective and compact hash representations for both face images and face videos. The network contains an image branch and a video branch to project face images and videos into a common discriminative space, respectively. Each channel are equipped with two modules: feature extractor module and non-linear mapping module. The feature extractor modules aim to represent face images or videos via appropriate features, and the non-linear mapping modules are designed to transform the heterogenous image and video feature spaces into the common space. On the common space, the similarity between a face image and a face video can be measured through the distance of their corresponding discriminative features, but these features are still high-dimensional vectors of floating point numbers, which cannot satisfy the requirements of low computation and storage complexities in the retrieval task. The non-linear hash functions are thus learned in the common space to obtain the corresponding binary hash representations. To catch the compatability and the effectiveness of the branches and the hash functions, the heterogeneous hashing network is trained with three loss functions: Fisher loss, softmax loss, and triplet ranking loss. Our Fisher loss uses the difference form of the inter-class and the intra-class scatter where the mean vectors are learnable variables, which is feasible for the mini-batch based optimization method. The Fisher loss and the softmax loss are jointly exploited to enhance the discriminative power of the common space. The triplet ranking loss is enforced to the final binary space for the improvement of the retrieval performance. Experiments on a large-scale face video dataset and two challenging TV-series datasets demonstrate the effectiveness of the proposed method. The contributions of the paper are three-folds: (1) We propose an effective cross-modality face retrieval method based on the heterogeneous hashing network. Our network is able to generate isomorphic discriminative compact binary representations of both face images and videos. (2) The proposed heterogeneous hashing network provides a general framework for deep learning based cross-modality hashing methods, and can be easily adopted in many other cross-modality retrieval tasks. (3) The proposed method achieves excellent results of face retrieval across image and video modalities on a large scale face video dataset and two challenging TV-series datasets.
Translated title of the contribution | Cross-Modality Face Retrieval Based on Heterogeneous Hashing Network |
---|---|
Original language | Chinese (Traditional) |
Pages (from-to) | 73-84 |
Number of pages | 12 |
Journal | Jisuanji Xuebao/Chinese Journal of Computers |
Volume | 42 |
Issue number | 1 |
DOIs | |
Publication status | Published - 1 Jan 2019 |