Abstract
In the recent era, graph neural networks are widely used on vision-to-language tasks and achieved promising results. In particular, graph convolution network (GCN) is capable of capturing spatial and semantic relationships needed for visual question answering (VQA). But, applying GCN on VQA datasets with different subtasks can lead to varying results. Also, the training and testing size, evaluation metrics and hyperparameter used are other factors that affect VQA results. These, factors can be subjected into similar evaluation schemes in order to obtain fair evaluations of GCN based result for VQA. This study proposed a GCN framework for VQA based on fine tune word representation to solve handle reasoning type questions. The framework performance is evaluated using various performance measures. The results obtained from GQA and VQA 2.0 datasets slightly outperform most existing methods.
Original language | English |
---|---|
Pages (from-to) | 40361-40370 |
Number of pages | 10 |
Journal | Multimedia Tools and Applications |
Volume | 81 |
Issue number | 28 |
DOIs | |
Publication status | Published - Nov 2022 |
Keywords
- Fine-tuned representation
- GCN
- Performance measure
- Reasoning datasets
- VQA