Exploring Entity-Level Spatial Relationships for Image-Text Matching

Yaxian Xia; Lun Huang; Wenmin Wang; Xiao Yong Wei; Jie Chen

doi:10.1109/ICASSP40776.2020.9054758

Exploring Entity-Level Spatial Relationships for Image-Text Matching

Yaxian Xia, Lun Huang, Wenmin Wang^*, Xiao Yong Wei, Jie Chen

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

4 Citations (Scopus)

Abstract

Exploring the entity-level (i.e., objects in an image, words in a text) spatial relationship contributes to understanding multimedia content precisely. The ignorance of spatial information in previous works probably leads to misunderstandings of image contents. For instance, sentences 'Boats are on the water' and 'Boats are under the water' describe the same objects, but correspond to different sceneries. To this end, we utilize the relative position of objects to capture entity-level spatial relationships for image-text matching. Specifically, we fuse semantic and spatial relationships of image objects in a visual intra-modal relation module. The module performs promisingly to understand image contents and improve object representation learning. It contributes to capturing entity-level latent correspondence of image-text pairs. Then the query (text) plays a role of textual context to refine the interpretable alignments of image-text pairs in the inter-modal relation module. Our proposed method achieves state-of-the-art results on MSCOCO and Flickr30K datasets.

Original language	English
Title of host publication	2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	4452-4456
Number of pages	5
ISBN (Electronic)	9781509066315
DOIs	https://doi.org/10.1109/ICASSP40776.2020.9054758
Publication status	Published - May 2020
Externally published	Yes
Event	2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain Duration: 4 May 2020 → 8 May 2020

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2020-May
ISSN (Print)	1520-6149

Conference

Conference	2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Country/Territory	Spain
City	Barcelona
Period	4/05/20 → 8/05/20

Keywords

Deep learning
entity-level relation
image-text matching
relative position

Access to Document

10.1109/ICASSP40776.2020.9054758

Cite this

Xia, Y., Huang, L., Wang, W., Wei, X. Y., & Chen, J. (2020). Exploring Entity-Level Spatial Relationships for Image-Text Matching. In 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings (pp. 4452-4456). Article 9054758 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2020-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP40776.2020.9054758

Xia, Yaxian ; Huang, Lun ; Wang, Wenmin et al. / Exploring Entity-Level Spatial Relationships for Image-Text Matching. 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2020. pp. 4452-4456 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{eb934585ced944639014aa5192650d71,

title = "Exploring Entity-Level Spatial Relationships for Image-Text Matching",

abstract = "Exploring the entity-level (i.e., objects in an image, words in a text) spatial relationship contributes to understanding multimedia content precisely. The ignorance of spatial information in previous works probably leads to misunderstandings of image contents. For instance, sentences 'Boats are on the water' and 'Boats are under the water' describe the same objects, but correspond to different sceneries. To this end, we utilize the relative position of objects to capture entity-level spatial relationships for image-text matching. Specifically, we fuse semantic and spatial relationships of image objects in a visual intra-modal relation module. The module performs promisingly to understand image contents and improve object representation learning. It contributes to capturing entity-level latent correspondence of image-text pairs. Then the query (text) plays a role of textual context to refine the interpretable alignments of image-text pairs in the inter-modal relation module. Our proposed method achieves state-of-the-art results on MSCOCO and Flickr30K datasets.",

keywords = "Deep learning, entity-level relation, image-text matching, relative position",

author = "Yaxian Xia and Lun Huang and Wenmin Wang and Wei, {Xiao Yong} and Jie Chen",

note = "Publisher Copyright: {\textcopyright} 2020 IEEE.; 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 ; Conference date: 04-05-2020 Through 08-05-2020",

year = "2020",

month = may,

doi = "10.1109/ICASSP40776.2020.9054758",

language = "English",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "4452--4456",

booktitle = "2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings",

address = "United States",

}

Xia, Y, Huang, L, Wang, W, Wei, XY & Chen, J 2020, Exploring Entity-Level Spatial Relationships for Image-Text Matching. in 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings., 9054758, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, Institute of Electrical and Electronics Engineers Inc., pp. 4452-4456, 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020, Barcelona, Spain, 4/05/20. https://doi.org/10.1109/ICASSP40776.2020.9054758

Exploring Entity-Level Spatial Relationships for Image-Text Matching. / Xia, Yaxian; Huang, Lun; Wang, Wenmin et al.
2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2020. p. 4452-4456 9054758 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2020-May).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Exploring Entity-Level Spatial Relationships for Image-Text Matching

AU - Xia, Yaxian

AU - Huang, Lun

AU - Wang, Wenmin

AU - Wei, Xiao Yong

AU - Chen, Jie

PY - 2020/5

Y1 - 2020/5

N2 - Exploring the entity-level (i.e., objects in an image, words in a text) spatial relationship contributes to understanding multimedia content precisely. The ignorance of spatial information in previous works probably leads to misunderstandings of image contents. For instance, sentences 'Boats are on the water' and 'Boats are under the water' describe the same objects, but correspond to different sceneries. To this end, we utilize the relative position of objects to capture entity-level spatial relationships for image-text matching. Specifically, we fuse semantic and spatial relationships of image objects in a visual intra-modal relation module. The module performs promisingly to understand image contents and improve object representation learning. It contributes to capturing entity-level latent correspondence of image-text pairs. Then the query (text) plays a role of textual context to refine the interpretable alignments of image-text pairs in the inter-modal relation module. Our proposed method achieves state-of-the-art results on MSCOCO and Flickr30K datasets.

AB - Exploring the entity-level (i.e., objects in an image, words in a text) spatial relationship contributes to understanding multimedia content precisely. The ignorance of spatial information in previous works probably leads to misunderstandings of image contents. For instance, sentences 'Boats are on the water' and 'Boats are under the water' describe the same objects, but correspond to different sceneries. To this end, we utilize the relative position of objects to capture entity-level spatial relationships for image-text matching. Specifically, we fuse semantic and spatial relationships of image objects in a visual intra-modal relation module. The module performs promisingly to understand image contents and improve object representation learning. It contributes to capturing entity-level latent correspondence of image-text pairs. Then the query (text) plays a role of textual context to refine the interpretable alignments of image-text pairs in the inter-modal relation module. Our proposed method achieves state-of-the-art results on MSCOCO and Flickr30K datasets.

KW - Deep learning

KW - entity-level relation

KW - image-text matching

KW - relative position

UR - http://www.scopus.com/inward/record.url?scp=85089209212&partnerID=8YFLogxK

U2 - 10.1109/ICASSP40776.2020.9054758

DO - 10.1109/ICASSP40776.2020.9054758

M3 - Conference contribution

AN - SCOPUS:85089209212

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 4452

EP - 4456

BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020

Y2 - 4 May 2020 through 8 May 2020

ER -

Xia Y, Huang L, Wang W, Wei XY, Chen J. Exploring Entity-Level Spatial Relationships for Image-Text Matching. In 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2020. p. 4452-4456. 9054758. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP40776.2020.9054758

Exploring Entity-Level Spatial Relationships for Image-Text Matching

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this