Multi-scale image–text matching network for scene and spatio-temporal images

Runde Yu, Fusheng Jin*, Zhuang Qiao, Ye Yuan, Guoren Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)

Abstract

In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

Original languageEnglish
Pages (from-to)292-300
Number of pages9
JournalFuture Generation Computer Systems
Volume142
DOIs
Publication statusPublished - May 2023

Keywords

  • Feature extraction
  • GNN
  • Multi-scale
  • Multimodal matching

Fingerprint

Dive into the research topics of 'Multi-scale image–text matching network for scene and spatio-temporal images'. Together they form a unique fingerprint.

Cite this