Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

Jin Chen; Xiaofeng Ji; Xinxiao Wu

Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

Jin Chen, Xiaofeng Ji, Xinxiao Wu^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

Scene graph in a video conveys a wealth of information about objects and their relationships in the scene, thus benefiting many downstream tasks such as video captioning and visual question answering. Existing methods of scene graph generation require large-scale training videos annotated with objects and relationships in each frame to learn a powerful model. However, such comprehensive annotation is time-consuming and labor-intensive. On the other hand, it is much easier and less cost to annotate images with scene graphs, so we investigate leveraging annotated images to facilitate training a scene graph generation model for unannotated videos, namely image-to-video scene graph generation. This task presents two challenges: 1) infer unseen dynamic relationships in videos from static relationships in images due to the absence of motion information in images; 2) adapt objects and static relationships from images to video frames due to the domain shift between them. To address the first challenge, we exploit external commonsense knowledge to infer the unseen dynamic relationship from the temporal evolution of static relationships. We tackle the second challenge by hierarchical adversarial learning to reduce the data distribution discrepancy between images and video frames. Extensive experiment results on two benchmark video datasets demonstrate the effectiveness of our method.

Original language	English
Title of host publication	AAAI-22 Technical Tracks 1
Publisher	Association for the Advancement of Artificial Intelligence
Pages	276-284
Number of pages	9
ISBN (Electronic)	1577358767, 9781577358763
Publication status	Published - 30 Jun 2022
Event	36th AAAI Conference on Artificial Intelligence, AAAI 2022 - Virtual, Online Duration: 22 Feb 2022 → 1 Mar 2022

Publication series

Name	Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022
Volume	36

Conference

Conference	36th AAAI Conference on Artificial Intelligence, AAAI 2022
City	Virtual, Online
Period	22/02/22 → 1/03/22

Cite this

Chen, J., Ji, X., & Wu, X. (2022). Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning. In AAAI-22 Technical Tracks 1 (pp. 276-284). (Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022; Vol. 36). Association for the Advancement of Artificial Intelligence.

@inproceedings{b4eed5ac2a5d49b183cbe93861812770,

title = "Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning",

abstract = "Scene graph in a video conveys a wealth of information about objects and their relationships in the scene, thus benefiting many downstream tasks such as video captioning and visual question answering. Existing methods of scene graph generation require large-scale training videos annotated with objects and relationships in each frame to learn a powerful model. However, such comprehensive annotation is time-consuming and labor-intensive. On the other hand, it is much easier and less cost to annotate images with scene graphs, so we investigate leveraging annotated images to facilitate training a scene graph generation model for unannotated videos, namely image-to-video scene graph generation. This task presents two challenges: 1) infer unseen dynamic relationships in videos from static relationships in images due to the absence of motion information in images; 2) adapt objects and static relationships from images to video frames due to the domain shift between them. To address the first challenge, we exploit external commonsense knowledge to infer the unseen dynamic relationship from the temporal evolution of static relationships. We tackle the second challenge by hierarchical adversarial learning to reduce the data distribution discrepancy between images and video frames. Extensive experiment results on two benchmark video datasets demonstrate the effectiveness of our method.",

author = "Jin Chen and Xiaofeng Ji and Xinxiao Wu",

note = "Publisher Copyright: Copyright {\textcopyright} 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 36th AAAI Conference on Artificial Intelligence, AAAI 2022 ; Conference date: 22-02-2022 Through 01-03-2022",

year = "2022",

month = jun,

day = "30",

language = "English",

series = "Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022",

publisher = "Association for the Advancement of Artificial Intelligence",

pages = "276--284",

booktitle = "AAAI-22 Technical Tracks 1",

}

Chen, J, Ji, X & Wu, X 2022, Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning. in AAAI-22 Technical Tracks 1. Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, vol. 36, Association for the Advancement of Artificial Intelligence, pp. 276-284, 36th AAAI Conference on Artificial Intelligence, AAAI 2022, Virtual, Online, 22/02/22.

Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning. / Chen, Jin; Ji, Xiaofeng; Wu, Xinxiao.
AAAI-22 Technical Tracks 1. Association for the Advancement of Artificial Intelligence, 2022. p. 276-284 (Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022; Vol. 36).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

AU - Chen, Jin

AU - Ji, Xiaofeng

AU - Wu, Xinxiao

PY - 2022/6/30

Y1 - 2022/6/30

N2 - Scene graph in a video conveys a wealth of information about objects and their relationships in the scene, thus benefiting many downstream tasks such as video captioning and visual question answering. Existing methods of scene graph generation require large-scale training videos annotated with objects and relationships in each frame to learn a powerful model. However, such comprehensive annotation is time-consuming and labor-intensive. On the other hand, it is much easier and less cost to annotate images with scene graphs, so we investigate leveraging annotated images to facilitate training a scene graph generation model for unannotated videos, namely image-to-video scene graph generation. This task presents two challenges: 1) infer unseen dynamic relationships in videos from static relationships in images due to the absence of motion information in images; 2) adapt objects and static relationships from images to video frames due to the domain shift between them. To address the first challenge, we exploit external commonsense knowledge to infer the unseen dynamic relationship from the temporal evolution of static relationships. We tackle the second challenge by hierarchical adversarial learning to reduce the data distribution discrepancy between images and video frames. Extensive experiment results on two benchmark video datasets demonstrate the effectiveness of our method.

AB - Scene graph in a video conveys a wealth of information about objects and their relationships in the scene, thus benefiting many downstream tasks such as video captioning and visual question answering. Existing methods of scene graph generation require large-scale training videos annotated with objects and relationships in each frame to learn a powerful model. However, such comprehensive annotation is time-consuming and labor-intensive. On the other hand, it is much easier and less cost to annotate images with scene graphs, so we investigate leveraging annotated images to facilitate training a scene graph generation model for unannotated videos, namely image-to-video scene graph generation. This task presents two challenges: 1) infer unseen dynamic relationships in videos from static relationships in images due to the absence of motion information in images; 2) adapt objects and static relationships from images to video frames due to the domain shift between them. To address the first challenge, we exploit external commonsense knowledge to infer the unseen dynamic relationship from the temporal evolution of static relationships. We tackle the second challenge by hierarchical adversarial learning to reduce the data distribution discrepancy between images and video frames. Extensive experiment results on two benchmark video datasets demonstrate the effectiveness of our method.

UR - http://www.scopus.com/inward/record.url?scp=85147718362&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85147718362

T3 - Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022

SP - 276

EP - 284

BT - AAAI-22 Technical Tracks 1

PB - Association for the Advancement of Artificial Intelligence

T2 - 36th AAAI Conference on Artificial Intelligence, AAAI 2022

Y2 - 22 February 2022 through 1 March 2022

ER -

Adaptive Image-to-Video Scene Graph Generation via Knowledge Reasoning and Adversarial Learning

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this