RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework

Tianze Cao; Sanyuan Zhao

doi:10.1007/978-981-97-8792-0_34

RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework

Tianze Cao, Sanyuan Zhao^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Video Instance Segmentation (VIS) is a pivotal technology for various applications including autonomous driving and video editing. While existing approaches primarily focus on enhancing the accuracy on datasets, they often neglect the real-time performance of segmentation results. Current online video instance segmentation models can sequentially output instance segmentation results for each frame and associate instances across adjacent frames; however, their inference speed limits practical applications. Additionally, these models require substantial memory during training and inference, which hinders deployment. To address these issues, we introduce the RT-VIS model, which adopts a decoupled strategy for VIS, allowing the model to be divided into segmenter and tracker components, each trained independently. To enhance inference speed, we developed a new light-weight instance segmentation model and employed a novel tracker to facilitate inter-frame instance association. Our model balances inference speed and accuracy, achieving 22.6 FPS and 42.2 AP on the YouTube-VIS2019 dataset, with memory requirements during training and inference approximately half of the previous methods. The code is available at https://github.com/STOVAGtz/RT-VIS.

Original language	English
Title of host publication	Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
Editors	Zhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	485-499
Number of pages	15
ISBN (Print)	9789819787913
DOIs	https://doi.org/10.1007/978-981-97-8792-0_34
Publication status	Published - 2025
Event	7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China Duration: 18 Oct 2024 → 20 Oct 2024

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	15040 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/Territory	China
City	Urumqi
Period	18/10/24 → 20/10/24

Keywords

Decoupled framework
Real-time
Video instance segmentation

Access to Document

10.1007/978-981-97-8792-0_34

Cite this

Cao, T., & Zhao, S. (2025). RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework. In Z. Lin, H. Zha, M.-M. Cheng, R. He, C.-L. Liu, K. Ubul, W. Silamu, & J. Zhou (Eds.), Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings (pp. 485-499). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 15040 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-8792-0_34

Cao, Tianze ; Zhao, Sanyuan. / RT-VIS : Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework. Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. editor / Zhouchen Lin ; Hongbin Zha ; Ming-Ming Cheng ; Ran He ; Cheng-Lin Liu ; Kurban Ubul ; Wushouer Silamu ; Jie Zhou. Springer Science and Business Media Deutschland GmbH, 2025. pp. 485-499 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{62754a3c0190494bb9f3506641276036,

title = "RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework",

abstract = "Video Instance Segmentation (VIS) is a pivotal technology for various applications including autonomous driving and video editing. While existing approaches primarily focus on enhancing the accuracy on datasets, they often neglect the real-time performance of segmentation results. Current online video instance segmentation models can sequentially output instance segmentation results for each frame and associate instances across adjacent frames; however, their inference speed limits practical applications. Additionally, these models require substantial memory during training and inference, which hinders deployment. To address these issues, we introduce the RT-VIS model, which adopts a decoupled strategy for VIS, allowing the model to be divided into segmenter and tracker components, each trained independently. To enhance inference speed, we developed a new light-weight instance segmentation model and employed a novel tracker to facilitate inter-frame instance association. Our model balances inference speed and accuracy, achieving 22.6 FPS and 42.2 AP on the YouTube-VIS2019 dataset, with memory requirements during training and inference approximately half of the previous methods. The code is available at https://github.com/STOVAGtz/RT-VIS.",

keywords = "Decoupled framework, Real-time, Video instance segmentation",

author = "Tianze Cao and Sanyuan Zhao",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.; 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 ; Conference date: 18-10-2024 Through 20-10-2024",

year = "2025",

doi = "10.1007/978-981-97-8792-0_34",

language = "English",

isbn = "9789819787913",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "485--499",

editor = "Zhouchen Lin and Hongbin Zha and Ming-Ming Cheng and Ran He and Cheng-Lin Liu and Kurban Ubul and Wushouer Silamu and Jie Zhou",

booktitle = "Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings",

address = "Germany",

}

Cao, T & Zhao, S 2025, RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework. in Z Lin, H Zha, M-M Cheng, R He, C-L Liu, K Ubul, W Silamu & J Zhou (eds), Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 15040 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 485-499, 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024, Urumqi, China, 18/10/24. https://doi.org/10.1007/978-981-97-8792-0_34

RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework. / Cao, Tianze; Zhao, Sanyuan.
Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. ed. / Zhouchen Lin; Hongbin Zha; Ming-Ming Cheng; Ran He; Cheng-Lin Liu; Kurban Ubul; Wushouer Silamu; Jie Zhou. Springer Science and Business Media Deutschland GmbH, 2025. p. 485-499 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 15040 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - RT-VIS

T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024

AU - Cao, Tianze

AU - Zhao, Sanyuan

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

PY - 2025

Y1 - 2025

N2 - Video Instance Segmentation (VIS) is a pivotal technology for various applications including autonomous driving and video editing. While existing approaches primarily focus on enhancing the accuracy on datasets, they often neglect the real-time performance of segmentation results. Current online video instance segmentation models can sequentially output instance segmentation results for each frame and associate instances across adjacent frames; however, their inference speed limits practical applications. Additionally, these models require substantial memory during training and inference, which hinders deployment. To address these issues, we introduce the RT-VIS model, which adopts a decoupled strategy for VIS, allowing the model to be divided into segmenter and tracker components, each trained independently. To enhance inference speed, we developed a new light-weight instance segmentation model and employed a novel tracker to facilitate inter-frame instance association. Our model balances inference speed and accuracy, achieving 22.6 FPS and 42.2 AP on the YouTube-VIS2019 dataset, with memory requirements during training and inference approximately half of the previous methods. The code is available at https://github.com/STOVAGtz/RT-VIS.

AB - Video Instance Segmentation (VIS) is a pivotal technology for various applications including autonomous driving and video editing. While existing approaches primarily focus on enhancing the accuracy on datasets, they often neglect the real-time performance of segmentation results. Current online video instance segmentation models can sequentially output instance segmentation results for each frame and associate instances across adjacent frames; however, their inference speed limits practical applications. Additionally, these models require substantial memory during training and inference, which hinders deployment. To address these issues, we introduce the RT-VIS model, which adopts a decoupled strategy for VIS, allowing the model to be divided into segmenter and tracker components, each trained independently. To enhance inference speed, we developed a new light-weight instance segmentation model and employed a novel tracker to facilitate inter-frame instance association. Our model balances inference speed and accuracy, achieving 22.6 FPS and 42.2 AP on the YouTube-VIS2019 dataset, with memory requirements during training and inference approximately half of the previous methods. The code is available at https://github.com/STOVAGtz/RT-VIS.

KW - Decoupled framework

KW - Real-time

KW - Video instance segmentation

UR - http://www.scopus.com/inward/record.url?scp=85209548701&partnerID=8YFLogxK

U2 - 10.1007/978-981-97-8792-0_34

DO - 10.1007/978-981-97-8792-0_34

M3 - Conference contribution

AN - SCOPUS:85209548701

SN - 9789819787913

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 485

EP - 499

BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings

A2 - Lin, Zhouchen

A2 - Zha, Hongbin

A2 - Cheng, Ming-Ming

A2 - He, Ran

A2 - Liu, Cheng-Lin

A2 - Ubul, Kurban

A2 - Silamu, Wushouer

A2 - Zhou, Jie

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 18 October 2024 through 20 October 2024

ER -

Cao T, Zhao S. RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework. In Lin Z, Zha H, Cheng MM, He R, Liu CL, Ubul K, Silamu W, Zhou J, editors, Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings. Springer Science and Business Media Deutschland GmbH. 2025. p. 485-499. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-981-97-8792-0_34

RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this