Multi-scale image–text matching network for scene and spatio-temporal images

Runde Yu; Fusheng Jin; Zhuang Qiao; Ye Yuan; Guoren Wang

doi:10.1016/j.future.2023.01.004

Multi-scale image–text matching network for scene and spatio-temporal images

Runde Yu, Fusheng Jin^*, Zhuang Qiao, Ye Yuan, Guoren Wang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

6 Citations (Scopus)

Abstract

In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

Original language	English
Pages (from-to)	292-300
Number of pages	9
Journal	Future Generation Computer Systems
Volume	142
DOIs	https://doi.org/10.1016/j.future.2023.01.004
Publication status	Published - May 2023

Keywords

Feature extraction
GNN
Multi-scale
Multimodal matching

Access to Document

10.1016/j.future.2023.01.004

Cite this

@article{a7d1a6c747f14a9898246ee475c84ff2,

title = "Multi-scale image–text matching network for scene and spatio-temporal images",

abstract = "In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.",

keywords = "Feature extraction, GNN, Multi-scale, Multimodal matching",

author = "Runde Yu and Fusheng Jin and Zhuang Qiao and Ye Yuan and Guoren Wang",

note = "Publisher Copyright: {\textcopyright} 2023 The Author(s)",

year = "2023",

month = may,

doi = "10.1016/j.future.2023.01.004",

language = "English",

volume = "142",

pages = "292--300",

journal = "Future Generation Computer Systems",

issn = "0167-739X",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Multi-scale image–text matching network for scene and spatio-temporal images

AU - Yu, Runde

AU - Jin, Fusheng

AU - Qiao, Zhuang

AU - Yuan, Ye

AU - Wang, Guoren

PY - 2023/5

Y1 - 2023/5

N2 - In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

AB - In recent years, with the development of deep learning technology, computer vision and natural language processing have made significant progress, and establishing the relationship between computer vision and natural language processing has attracted more and more attention. The spatio-temporal images taken by satellites or aircrafts and scene images with people and other things are the main focus area. Existing methods have yielded excellent results in image–text matching, but there is still room for improvement in effectively using coarse and fine-grained information. We propose a method to solve this problem using multi-scale graph convolutional neural networks. We extracted the multi-scale features of images and texts for matching separately. Global and local matching are used to calculate the overall image sentence and local image–word similarity. Local matching is divided into two stages, first, the node level matches the correspondence between the learning region and the word. Next, the structure level matches the correspondence between the learning region and the phrase to make the matching more comprehensive. Finally, we verified our model on Flickr30k, MSCOCO and RSICD datasets.

KW - Feature extraction

KW - GNN

KW - Multi-scale

KW - Multimodal matching

UR - http://www.scopus.com/inward/record.url?scp=85146292771&partnerID=8YFLogxK

U2 - 10.1016/j.future.2023.01.004

DO - 10.1016/j.future.2023.01.004

M3 - Article

AN - SCOPUS:85146292771

SN - 0167-739X

VL - 142

SP - 292

EP - 300

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

ER -

Multi-scale image–text matching network for scene and spatio-temporal images

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this