TY - GEN
T1 - Joint Learning of Object Graph and Relation Graph for Visual Question Answering
AU - Li, Hao
AU - Li, Xu
AU - Karimi, Belhal
AU - Chen, Jie
AU - Sun, Mingming
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Modeling visual question answering (VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which causes false attribute selection or missing relation in Figure 1(a). It is because these models cannot balance all kinds of information in scene graphs, neglecting relation and attribute information. In this paper, we introduce a novel Dual Message-passing enhanced Graph Neural Net-work (DM-GNN), which can obtain a balanced represen-tation by properly encoding multi-scale scene graph infor-mation. Specifically, we (i) transform the scene graph into two graphs with diversified focuses on objects and relations; Then we design a dual structure to encode them, which in-creases the weights from relations (ii) fuse the encoder out-put with attribute features, which increases the weights from attributes; (iii) propose a message-passing mechanism to en-hance the information transfer between objects, relations and attributes. We conduct extensive experiments on datasets in-cluding GQA, VG, motif-VG and achieve new state of the art.
AB - Modeling visual question answering (VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which causes false attribute selection or missing relation in Figure 1(a). It is because these models cannot balance all kinds of information in scene graphs, neglecting relation and attribute information. In this paper, we introduce a novel Dual Message-passing enhanced Graph Neural Net-work (DM-GNN), which can obtain a balanced represen-tation by properly encoding multi-scale scene graph infor-mation. Specifically, we (i) transform the scene graph into two graphs with diversified focuses on objects and relations; Then we design a dual structure to encode them, which in-creases the weights from relations (ii) fuse the encoder out-put with attribute features, which increases the weights from attributes; (iii) propose a message-passing mechanism to en-hance the information transfer between objects, relations and attributes. We conduct extensive experiments on datasets in-cluding GQA, VG, motif-VG and achieve new state of the art.
KW - Graph Neural Network
KW - Scene Graph
KW - Visual Question Answer
UR - http://www.scopus.com/inward/record.url?scp=85137713956&partnerID=8YFLogxK
U2 - 10.1109/ICME52920.2022.9859766
DO - 10.1109/ICME52920.2022.9859766
M3 - Conference contribution
AN - SCOPUS:85137713956
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - ICME 2022 - IEEE International Conference on Multimedia and Expo 2022, Proceedings
PB - IEEE Computer Society
T2 - 2022 IEEE International Conference on Multimedia and Expo, ICME 2022
Y2 - 18 July 2022 through 22 July 2022
ER -