Visual-Semantic Graph Matching for Visual Grounding

Chenchen Jing; Yuwei Wu; Mingtao Pei; Yao Hu; Yunde Jia; Qi Wu

doi:10.1145/3394171.3413902

Visual-Semantic Graph Matching for Visual Grounding

Chenchen Jing, Yuwei Wu^*, Mingtao Pei, Yao Hu, Yunde Jia, Qi Wu

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

28 引用（Scopus）

摘要

Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

源语言	英语
主期刊名	MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	4041-4050
页数	10
ISBN（电子版）	9781450379885
DOI	https://doi.org/10.1145/3394171.3413902
出版状态	已出版 - 12 10月 2020
活动	28th ACM International Conference on Multimedia, MM 2020 - Virtual, Online, 美国期限: 12 10月 2020 → 16 10月 2020

出版系列

姓名	MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

会议

会议	28th ACM International Conference on Multimedia, MM 2020
国家/地区	美国
市	Virtual, Online
时期	12/10/20 → 16/10/20

访问文件

10.1145/3394171.3413902

其它文件与链接

链接到 Scopus 的出版物

引用此

Jing, C., Wu, Y., Pei, M., Hu, Y., Jia, Y., & Wu, Q. (2020). Visual-Semantic Graph Matching for Visual Grounding. 在 MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (页码 4041-4050). (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413902

@inproceedings{4158407ab8e841adbaed0a0702e0a862,

title = "Visual-Semantic Graph Matching for Visual Grounding",

abstract = "Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.",

keywords = "graph matching, language scene graph, visual grounding, visual scene graph",

author = "Chenchen Jing and Yuwei Wu and Mingtao Pei and Yao Hu and Yunde Jia and Qi Wu",

note = "Publisher Copyright: {\textcopyright} 2020 ACM.; 28th ACM International Conference on Multimedia, MM 2020 ; Conference date: 12-10-2020 Through 16-10-2020",

year = "2020",

month = oct,

day = "12",

doi = "10.1145/3394171.3413902",

language = "English",

series = "MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "4041--4050",

booktitle = "MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia",

}

Jing, C, Wu, Y, Pei, M, Hu, Y, Jia, Y & Wu, Q 2020, Visual-Semantic Graph Matching for Visual Grounding. 在 MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 4041-4050, 28th ACM International Conference on Multimedia, MM 2020, Virtual, Online, 美国, 12/10/20. https://doi.org/10.1145/3394171.3413902

Visual-Semantic Graph Matching for Visual Grounding. / Jing, Chenchen; Wu, Yuwei; Pei, Mingtao 等.
MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2020. 页码 4041-4050 (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Visual-Semantic Graph Matching for Visual Grounding

AU - Jing, Chenchen

AU - Wu, Yuwei

AU - Pei, Mingtao

AU - Hu, Yao

AU - Jia, Yunde

AU - Wu, Qi

PY - 2020/10/12

Y1 - 2020/10/12

N2 - Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

AB - Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

KW - graph matching

KW - language scene graph

KW - visual grounding

KW - visual scene graph

UR - http://www.scopus.com/inward/record.url?scp=85106671324&partnerID=8YFLogxK

U2 - 10.1145/3394171.3413902

DO - 10.1145/3394171.3413902

M3 - Conference contribution

AN - SCOPUS:85106671324

T3 - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

SP - 4041

EP - 4050

BT - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 28th ACM International Conference on Multimedia, MM 2020

Y2 - 12 October 2020 through 16 October 2020

ER -

Visual-Semantic Graph Matching for Visual Grounding

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此