TY - JOUR
T1 - Adversarial Sample Synthesis for Visual Question Answering
AU - Li, Chuanhao
AU - Jing, Chenchen
AU - Li, Zhen
AU - Wu, Yuwei
AU - Jia, Yunde
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/11/21
Y1 - 2024/11/21
N2 - Language prior is a major block to improving the generalization of visual question answering (VQA) models. Recent work has revealed that synthesizing extra training samples to balance training sets is a promising way to alleviate language priors. However, most existing methods synthesize extra samples in a manner independent of training processes, which neglect the fact that the language priors memorized by VQA models are changing during training, resulting in insufficient synthesized samples. In this article, we propose an adversarial sample synthesis method, which synthesizes different adversarial samples by adversarial masking at different training epochs to cope with the changing memorized language priors. The basic idea behind our method is to use adversarial masking to synthesize adversarial samples that will cause the model to make wrong answers. To this end, we design a generative module to carry out adversarial masking by attacking the VQA model and introduce a bias-oriented objective to supervise the training of the generative module. We couple the sample synthesis with the training process of the VQA model, which ensures that the synthesized samples at different training epochs are beneficial to the VQA model. We incorporated the proposed method into three VQA models including UpDn, LMH, and LXMERT and conducted experiments on three datasets including VQA-CP v1, VQA-CP v2, and VQA v2. Experimental results demonstrate that a large improvement of our method, such as 16.22% gains on LXMERT in the overall accuracy of VQA-CP v2.
AB - Language prior is a major block to improving the generalization of visual question answering (VQA) models. Recent work has revealed that synthesizing extra training samples to balance training sets is a promising way to alleviate language priors. However, most existing methods synthesize extra samples in a manner independent of training processes, which neglect the fact that the language priors memorized by VQA models are changing during training, resulting in insufficient synthesized samples. In this article, we propose an adversarial sample synthesis method, which synthesizes different adversarial samples by adversarial masking at different training epochs to cope with the changing memorized language priors. The basic idea behind our method is to use adversarial masking to synthesize adversarial samples that will cause the model to make wrong answers. To this end, we design a generative module to carry out adversarial masking by attacking the VQA model and introduce a bias-oriented objective to supervise the training of the generative module. We couple the sample synthesis with the training process of the VQA model, which ensures that the synthesized samples at different training epochs are beneficial to the VQA model. We incorporated the proposed method into three VQA models including UpDn, LMH, and LXMERT and conducted experiments on three datasets including VQA-CP v1, VQA-CP v2, and VQA v2. Experimental results demonstrate that a large improvement of our method, such as 16.22% gains on LXMERT in the overall accuracy of VQA-CP v2.
KW - Adversarial Masking
KW - Adversarial Sample Synthesis
KW - Language Priors
KW - Visual Question Answering
UR - http://www.scopus.com/inward/record.url?scp=85217851883&partnerID=8YFLogxK
U2 - 10.1145/3688848
DO - 10.1145/3688848
M3 - Article
AN - SCOPUS:85217851883
SN - 1551-6857
VL - 20
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 12
M1 - 378
ER -