Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection

Zihan Wen; Jin Chen; Xinxiao Wu

doi:10.1007/978-3-031-27066-6_25

Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection

Zihan Wen, Jin Chen, Xinxiao Wu^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Image-to-video object detection leverages annotated images to help detect objects in unannotated videos, so as to break the heavy dependency on the expensive annotation of large-scale video frames. This task is extremely challenging due to the serious domain discrepancy between images and video frames caused by appearance variance and motion blur. Previous methods perform both image-level and instance-level alignments to reduce the domain discrepancy, but the existing false instance alignments may limit their performance in real scenarios. We propose a novel spatial-temporal graph to model the contextual relationships between instances to alleviate the false alignments. Through message propagation over the graph, the visual information from the spatial and temporal neighboring object proposals are adaptively aggregated to enhance the current instance representation. Moreover, to adapt the source-biased decision boundary to the target data, we generate an intermediate domain between images and frames. It is worth mentioning that our method can be easily applied as a plug-and-play component to other image-to-video object detection models based on the instance alignment. Experiments on several datasets demonstrate the effectiveness of our method. Code will be available at: https://github.com/wenzihan/STMP.

Original language	English
Title of host publication	Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers
Editors	Yinqiang Zheng, Hacer Yalim Keleş, Piotr Koniusz
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	360-375
Number of pages	16
ISBN (Print)	9783031270659
DOIs	https://doi.org/10.1007/978-3-031-27066-6_25
Publication status	Published - 2023
Event	16th Asian Conference on Computer Vision , ACCV 2022 - Macao, China Duration: 4 Dec 2022 → 8 Dec 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13848 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th Asian Conference on Computer Vision , ACCV 2022
Country/Territory	China
City	Macao
Period	4/12/22 → 8/12/22

Keywords

Deep learning
Domain adaptation
Object detection

Access to Document

10.1007/978-3-031-27066-6_25

Cite this

Wen, Z., Chen, J., & Wu, X. (2023). Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection. In Y. Zheng, H. Y. Keleş, & P. Koniusz (Eds.), Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers (pp. 360-375). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13848 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-27066-6_25

Wen, Zihan ; Chen, Jin ; Wu, Xinxiao. / Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection. Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers. editor / Yinqiang Zheng ; Hacer Yalim Keleş ; Piotr Koniusz. Springer Science and Business Media Deutschland GmbH, 2023. pp. 360-375 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{b20f23efea8546d5be54293fd8a918ec,

title = "Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection",

abstract = "Image-to-video object detection leverages annotated images to help detect objects in unannotated videos, so as to break the heavy dependency on the expensive annotation of large-scale video frames. This task is extremely challenging due to the serious domain discrepancy between images and video frames caused by appearance variance and motion blur. Previous methods perform both image-level and instance-level alignments to reduce the domain discrepancy, but the existing false instance alignments may limit their performance in real scenarios. We propose a novel spatial-temporal graph to model the contextual relationships between instances to alleviate the false alignments. Through message propagation over the graph, the visual information from the spatial and temporal neighboring object proposals are adaptively aggregated to enhance the current instance representation. Moreover, to adapt the source-biased decision boundary to the target data, we generate an intermediate domain between images and frames. It is worth mentioning that our method can be easily applied as a plug-and-play component to other image-to-video object detection models based on the instance alignment. Experiments on several datasets demonstrate the effectiveness of our method. Code will be available at: https://github.com/wenzihan/STMP.",

keywords = "Deep learning, Domain adaptation, Object detection",

author = "Zihan Wen and Jin Chen and Xinxiao Wu",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 16th Asian Conference on Computer Vision , ACCV 2022 ; Conference date: 04-12-2022 Through 08-12-2022",

year = "2023",

doi = "10.1007/978-3-031-27066-6_25",

language = "English",

isbn = "9783031270659",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "360--375",

editor = "Yinqiang Zheng and Kele{\c s}, {Hacer Yalim} and Piotr Koniusz",

booktitle = "Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers",

address = "Germany",

}

Wen, Z, Chen, J & Wu, X 2023, Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection. in Y Zheng, HY Keleş & P Koniusz (eds), Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13848 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 360-375, 16th Asian Conference on Computer Vision , ACCV 2022, Macao, China, 4/12/22. https://doi.org/10.1007/978-3-031-27066-6_25

Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection. / Wen, Zihan; Chen, Jin; Wu, Xinxiao.
Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers. ed. / Yinqiang Zheng; Hacer Yalim Keleş; Piotr Koniusz. Springer Science and Business Media Deutschland GmbH, 2023. p. 360-375 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13848 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection

AU - Wen, Zihan

AU - Chen, Jin

AU - Wu, Xinxiao

PY - 2023

Y1 - 2023

N2 - Image-to-video object detection leverages annotated images to help detect objects in unannotated videos, so as to break the heavy dependency on the expensive annotation of large-scale video frames. This task is extremely challenging due to the serious domain discrepancy between images and video frames caused by appearance variance and motion blur. Previous methods perform both image-level and instance-level alignments to reduce the domain discrepancy, but the existing false instance alignments may limit their performance in real scenarios. We propose a novel spatial-temporal graph to model the contextual relationships between instances to alleviate the false alignments. Through message propagation over the graph, the visual information from the spatial and temporal neighboring object proposals are adaptively aggregated to enhance the current instance representation. Moreover, to adapt the source-biased decision boundary to the target data, we generate an intermediate domain between images and frames. It is worth mentioning that our method can be easily applied as a plug-and-play component to other image-to-video object detection models based on the instance alignment. Experiments on several datasets demonstrate the effectiveness of our method. Code will be available at: https://github.com/wenzihan/STMP.

AB - Image-to-video object detection leverages annotated images to help detect objects in unannotated videos, so as to break the heavy dependency on the expensive annotation of large-scale video frames. This task is extremely challenging due to the serious domain discrepancy between images and video frames caused by appearance variance and motion blur. Previous methods perform both image-level and instance-level alignments to reduce the domain discrepancy, but the existing false instance alignments may limit their performance in real scenarios. We propose a novel spatial-temporal graph to model the contextual relationships between instances to alleviate the false alignments. Through message propagation over the graph, the visual information from the spatial and temporal neighboring object proposals are adaptively aggregated to enhance the current instance representation. Moreover, to adapt the source-biased decision boundary to the target data, we generate an intermediate domain between images and frames. It is worth mentioning that our method can be easily applied as a plug-and-play component to other image-to-video object detection models based on the instance alignment. Experiments on several datasets demonstrate the effectiveness of our method. Code will be available at: https://github.com/wenzihan/STMP.

KW - Deep learning

KW - Domain adaptation

KW - Object detection

UR - http://www.scopus.com/inward/record.url?scp=85151054387&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-27066-6_25

DO - 10.1007/978-3-031-27066-6_25

M3 - Conference contribution

AN - SCOPUS:85151054387

SN - 9783031270659

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 360

EP - 375

BT - Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers

A2 - Zheng, Yinqiang

A2 - Keleş, Hacer Yalim

A2 - Koniusz, Piotr

PB - Springer Science and Business Media Deutschland GmbH

T2 - 16th Asian Conference on Computer Vision , ACCV 2022

Y2 - 4 December 2022 through 8 December 2022

ER -

Wen Z, Chen J, Wu X. Exploring Spatial-Temporal Instance Relationships in an Intermediate Domain for Image-to-Video Object Detection. In Zheng Y, Keleş HY, Koniusz P, editors, Computer Vision – ACCV 2022 Workshops - 16th Asian Conference on Computer Vision, Revised Selected Papers. Springer Science and Business Media Deutschland GmbH. 2023. p. 360-375. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-27066-6_25