Abstract
In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.
Original language | English |
---|---|
Pages (from-to) | 292-300 |
Number of pages | 9 |
Journal | Future Generation Computer Systems |
Volume | 142 |
DOIs | |
Publication status | Published - May 2023 |
Keywords
- Feature extraction
- GNN
- Multi-scale
- Multimodal matching