TY - GEN
T1 - Visual-Semantic Graph Matching for Visual Grounding
AU - Jing, Chenchen
AU - Wu, Yuwei
AU - Pei, Mingtao
AU - Hu, Yao
AU - Jia, Yunde
AU - Wu, Qi
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/10/12
Y1 - 2020/10/12
N2 - Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.
AB - Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.
KW - graph matching
KW - language scene graph
KW - visual grounding
KW - visual scene graph
UR - http://www.scopus.com/inward/record.url?scp=85106671324&partnerID=8YFLogxK
U2 - 10.1145/3394171.3413902
DO - 10.1145/3394171.3413902
M3 - Conference contribution
AN - SCOPUS:85106671324
T3 - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
SP - 4041
EP - 4050
BT - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 28th ACM International Conference on Multimedia, MM 2020
Y2 - 12 October 2020 through 16 October 2020
ER -