Deep atribute-preserving metric learning for natural language object retrieval

  • Jianan Li
  • , Yunchao Wei
  • , Xiaodan Liang
  • , Fang Zhao
  • , Jianshu Li
  • , Tingfa Xu*
  • , Jiashi Feng
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

22 Citations (Scopus)

Abstract

Retrieving image content with a natural language expression is an emerging interdisciplinary problem at the intersection of multimedia, natural language processing and artificial intelligence. Existing methods tackle this challenging problem by learning features from the visual and linguistic domains independently while the critical semantic correlations bridging two domains have been under-explored in the feature learning process. In this paper, we propose to exploit sharable semantic attributes as "anchors" to ensure the learned features are well aligned across domains for better object retrieval. We define "attributes" as the common concepts that are informative for object retrieval and can be easily learned from both visual content and language expression. In particular, diverse and complex attributes (e.g., location, color, category, interaction between object and context) are modeled and incorporated to promote cross-domain alignment for feature learning from multiple perspectives. Based on the sharable attributes, we propose a deep Attribute-Preserving Metric learning (AP-Metric) framework that jointly generates unique query-sensitive region proposals and conducts novel cross-modal feature learning that explicitly pursues consistency over semantic attribute abstraction within both domains for deep metric learning. Benefiting from the cross-modal semantic correlations, our proposed framework can localize challenging visual objects to match complex query expressions within cluttered background accurately. The overall framework is end-to-end trainable. Extensive evaluations on popular datasets including ReferItGame [18], RefCOCO, and RefCOCO+ [43] well demonstrate its superiority. Notably, it achieves state-of-the-art performance on the challenging ReferItGame dataset.

Original languageEnglish
Title of host publicationMM 2017 - Proceedings of the 2017 ACM Multimedia Conference
PublisherAssociation for Computing Machinery, Inc
Pages181-189
Number of pages9
ISBN (Electronic)9781450349062
DOIs
Publication statusPublished - 23 Oct 2017
Event25th ACM International Conference on Multimedia, MM 2017 - Mountain View, United States
Duration: 23 Oct 201727 Oct 2017

Publication series

NameMM 2017 - Proceedings of the 2017 ACM Multimedia Conference

Conference

Conference25th ACM International Conference on Multimedia, MM 2017
Country/TerritoryUnited States
CityMountain View
Period23/10/1727/10/17

Keywords

  • Attribute
  • Cross-modal
  • Object retrieval

Fingerprint

Dive into the research topics of 'Deep atribute-preserving metric learning for natural language object retrieval'. Together they form a unique fingerprint.

Cite this