TY - GEN
T1 - A Self-supervised Strategy for the Robustness of VQA Models
AU - Su, Jingyu
AU - Li, Chuanhao
AU - Jing, Chenchen
AU - Wu, Yuwei
N1 - Publisher Copyright:
© 2022, IFIP International Federation for Information Processing.
PY - 2022
Y1 - 2022
N2 - In visual question answering (VQA), most existing models suffer from language biases which make models not robust. Recently, many approaches have been proposed to alleviate language biases by generating samples for the VQA task. These methods require the model to distinguish original samples from synthetic samples, to ensure that the model fully understands two modalities of both visual and linguistic information rather than just predicts answers based on language biases. However, these models are still not sensitive enough to changes of key information in questions. To make full use of the key information in questions, we design a self-supervised strategy to make the nouns of questions be focused for enhancing the robustness of VQA models. Its auxiliary training process, predicting answers for synthetic samples generated by masking the last noun in questions, alleviates the negative influence of language biases. Experiments conducted on VQA-CP v2 and VQA v2 datasets show that our method achieves better results than other VQA models.
AB - In visual question answering (VQA), most existing models suffer from language biases which make models not robust. Recently, many approaches have been proposed to alleviate language biases by generating samples for the VQA task. These methods require the model to distinguish original samples from synthetic samples, to ensure that the model fully understands two modalities of both visual and linguistic information rather than just predicts answers based on language biases. However, these models are still not sensitive enough to changes of key information in questions. To make full use of the key information in questions, we design a self-supervised strategy to make the nouns of questions be focused for enhancing the robustness of VQA models. Its auxiliary training process, predicting answers for synthetic samples generated by masking the last noun in questions, alleviates the negative influence of language biases. Experiments conducted on VQA-CP v2 and VQA v2 datasets show that our method achieves better results than other VQA models.
KW - Language bias
KW - Self-supervised learning
KW - Visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85132045039&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-03948-5_23
DO - 10.1007/978-3-031-03948-5_23
M3 - Conference contribution
AN - SCOPUS:85132045039
SN - 9783031039478
T3 - IFIP Advances in Information and Communication Technology
SP - 290
EP - 298
BT - Intelligent Information Processing XI - 12th IFIP TC 12 International Conference, IIP 2022, Proceedings
A2 - Shi, Zhongzhi
A2 - Zucker, Jean-Daniel
A2 - An, Bo
PB - Springer Science and Business Media Deutschland GmbH
T2 - 12th IFIP TC 12 International Conference on Intelligent Information Processing, IIP 2022
Y2 - 27 May 2022 through 30 May 2022
ER -