Adversarial Sample Synthesis for Visual Question Answering

Chuanhao Li, Chenchen Jing, Zhen Li, Yuwei Wu*, Yunde Jia

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Language prior is a major block to improving the generalization of visual question answering (VQA) models. Recent work has revealed that synthesizing extra training samples to balance training sets is a promising way to alleviate language priors. However, most existing methods synthesize extra samples in a manner independent of training processes, which neglect the fact that the language priors memorized by VQA models are changing during training, resulting in insufficient synthesized samples. In this article, we propose an adversarial sample synthesis method, which synthesizes different adversarial samples by adversarial masking at different training epochs to cope with the changing memorized language priors. The basic idea behind our method is to use adversarial masking to synthesize adversarial samples that will cause the model to make wrong answers. To this end, we design a generative module to carry out adversarial masking by attacking the VQA model and introduce a bias-oriented objective to supervise the training of the generative module. We couple the sample synthesis with the training process of the VQA model, which ensures that the synthesized samples at different training epochs are beneficial to the VQA model. We incorporated the proposed method into three VQA models including UpDn, LMH, and LXMERT and conducted experiments on three datasets including VQA-CP v1, VQA-CP v2, and VQA v2. Experimental results demonstrate that a large improvement of our method, such as 16.22% gains on LXMERT in the overall accuracy of VQA-CP v2.

Original languageEnglish
Article number378
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume20
Issue number12
DOIs
Publication statusPublished - 21 Nov 2024

Keywords

  • Adversarial Masking
  • Adversarial Sample Synthesis
  • Language Priors
  • Visual Question Answering

Fingerprint

Dive into the research topics of 'Adversarial Sample Synthesis for Visual Question Answering'. Together they form a unique fingerprint.

Cite this