Visual-Semantic Graph Matching for Visual Grounding

Chenchen Jing, Yuwei Wu*, Mingtao Pei, Yao Hu, Yunde Jia, Qi Wu

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

24 引用 (Scopus)

摘要

Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

源语言英语
主期刊名MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
出版商Association for Computing Machinery, Inc
4041-4050
页数10
ISBN(电子版)9781450379885
DOI
出版状态已出版 - 12 10月 2020
活动28th ACM International Conference on Multimedia, MM 2020 - Virtual, Online, 美国
期限: 12 10月 202016 10月 2020

出版系列

姓名MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

会议

会议28th ACM International Conference on Multimedia, MM 2020
国家/地区美国
Virtual, Online
时期12/10/2016/10/20

指纹

探究 'Visual-Semantic Graph Matching for Visual Grounding' 的科研主题。它们共同构成独一无二的指纹。

引用此