TY - GEN
T1 - Deep atribute-preserving metric learning for natural language object retrieval
AU - Li, Jianan
AU - Wei, Yunchao
AU - Liang, Xiaodan
AU - Zhao, Fang
AU - Li, Jianshu
AU - Xu, Tingfa
AU - Feng, Jiashi
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/10/23
Y1 - 2017/10/23
N2 - Retrieving image content with a natural language expression is an emerging interdisciplinary problem at the intersection of multimedia, natural language processing and artificial intelligence. Existing methods tackle this challenging problem by learning features from the visual and linguistic domains independently while the critical semantic correlations bridging two domains have been under-explored in the feature learning process. In this paper, we propose to exploit sharable semantic attributes as "anchors" to ensure the learned features are well aligned across domains for better object retrieval. We define "attributes" as the common concepts that are informative for object retrieval and can be easily learned from both visual content and language expression. In particular, diverse and complex attributes (e.g., location, color, category, interaction between object and context) are modeled and incorporated to promote cross-domain alignment for feature learning from multiple perspectives. Based on the sharable attributes, we propose a deep Attribute-Preserving Metric learning (AP-Metric) framework that jointly generates unique query-sensitive region proposals and conducts novel cross-modal feature learning that explicitly pursues consistency over semantic attribute abstraction within both domains for deep metric learning. Benefiting from the cross-modal semantic correlations, our proposed framework can localize challenging visual objects to match complex query expressions within cluttered background accurately. The overall framework is end-to-end trainable. Extensive evaluations on popular datasets including ReferItGame [18], RefCOCO, and RefCOCO+ [43] well demonstrate its superiority. Notably, it achieves state-of-the-art performance on the challenging ReferItGame dataset.
AB - Retrieving image content with a natural language expression is an emerging interdisciplinary problem at the intersection of multimedia, natural language processing and artificial intelligence. Existing methods tackle this challenging problem by learning features from the visual and linguistic domains independently while the critical semantic correlations bridging two domains have been under-explored in the feature learning process. In this paper, we propose to exploit sharable semantic attributes as "anchors" to ensure the learned features are well aligned across domains for better object retrieval. We define "attributes" as the common concepts that are informative for object retrieval and can be easily learned from both visual content and language expression. In particular, diverse and complex attributes (e.g., location, color, category, interaction between object and context) are modeled and incorporated to promote cross-domain alignment for feature learning from multiple perspectives. Based on the sharable attributes, we propose a deep Attribute-Preserving Metric learning (AP-Metric) framework that jointly generates unique query-sensitive region proposals and conducts novel cross-modal feature learning that explicitly pursues consistency over semantic attribute abstraction within both domains for deep metric learning. Benefiting from the cross-modal semantic correlations, our proposed framework can localize challenging visual objects to match complex query expressions within cluttered background accurately. The overall framework is end-to-end trainable. Extensive evaluations on popular datasets including ReferItGame [18], RefCOCO, and RefCOCO+ [43] well demonstrate its superiority. Notably, it achieves state-of-the-art performance on the challenging ReferItGame dataset.
KW - Attribute
KW - Cross-modal
KW - Object retrieval
UR - http://www.scopus.com/inward/record.url?scp=85035193259&partnerID=8YFLogxK
U2 - 10.1145/3123266.3123439
DO - 10.1145/3123266.3123439
M3 - Conference contribution
AN - SCOPUS:85035193259
T3 - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
SP - 181
EP - 189
BT - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
PB - Association for Computing Machinery, Inc
T2 - 25th ACM International Conference on Multimedia, MM 2017
Y2 - 23 October 2017 through 27 October 2017
ER -