TY - GEN
T1 - Maintaining Reasoning Consistency in Compositional Visual Question Answering
AU - Jing, Chenchen
AU - Jia, Yunde
AU - Wu, Yuwei
AU - Liu, Xinyu
AU - Wu, Qi
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - A compositional question refers to a question that contains multiple visual concepts (e.g., objects, attributes, and relationships) and requires compositional reasoning to answer. Existing VQA models can answer a compositional question well, but cannot work well in terms of reasoning consistency in answering the compositional question and its sub-questions. For example, a compositional question for an image is: 'Are there any elephants to the right of the white bird?' and one of its sub-questions is 'Is any bird visible in the scene?'. The models may answer 'yes' to the compositional question, but 'no' to the sub-question. This paper presents a dialog-like reasoning method for maintaining reasoning consistency in answering a compositional question and its sub-questions. Our method integrates the reasoning processes for the sub-questions into the reasoning process for the compositional question like a dialog task, and uses a consistency constraint to penalize inconsistent answer predictions. In order to enable quantitative evaluation of reasoning consistency, we construct a GQA-Sub dataset based on the well-organized GQA dataset. Experimental results on the GQA dataset and the GQA-Sub dataset demonstrate the effectiveness of our method.
AB - A compositional question refers to a question that contains multiple visual concepts (e.g., objects, attributes, and relationships) and requires compositional reasoning to answer. Existing VQA models can answer a compositional question well, but cannot work well in terms of reasoning consistency in answering the compositional question and its sub-questions. For example, a compositional question for an image is: 'Are there any elephants to the right of the white bird?' and one of its sub-questions is 'Is any bird visible in the scene?'. The models may answer 'yes' to the compositional question, but 'no' to the sub-question. This paper presents a dialog-like reasoning method for maintaining reasoning consistency in answering a compositional question and its sub-questions. Our method integrates the reasoning processes for the sub-questions into the reasoning process for the compositional question like a dialog task, and uses a consistency constraint to penalize inconsistent answer predictions. In order to enable quantitative evaluation of reasoning consistency, we construct a GQA-Sub dataset based on the well-organized GQA dataset. Experimental results on the GQA dataset and the GQA-Sub dataset demonstrate the effectiveness of our method.
KW - Vision + language
KW - Visual reasoning
UR - http://www.scopus.com/inward/record.url?scp=85141746478&partnerID=8YFLogxK
U2 - 10.1109/CVPR52688.2022.00504
DO - 10.1109/CVPR52688.2022.00504
M3 - Conference contribution
AN - SCOPUS:85141746478
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 5089
EP - 5098
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE Computer Society
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Y2 - 19 June 2022 through 24 June 2022
ER -