RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework

Tianze Cao, Sanyuan Zhao*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video Instance Segmentation (VIS) is a pivotal technology for various applications including autonomous driving and video editing. While existing approaches primarily focus on enhancing the accuracy on datasets, they often neglect the real-time performance of segmentation results. Current online video instance segmentation models can sequentially output instance segmentation results for each frame and associate instances across adjacent frames; however, their inference speed limits practical applications. Additionally, these models require substantial memory during training and inference, which hinders deployment. To address these issues, we introduce the RT-VIS model, which adopts a decoupled strategy for VIS, allowing the model to be divided into segmenter and tracker components, each trained independently. To enhance inference speed, we developed a new light-weight instance segmentation model and employed a novel tracker to facilitate inter-frame instance association. Our model balances inference speed and accuracy, achieving 22.6 FPS and 42.2 AP on the YouTube-VIS2019 dataset, with memory requirements during training and inference approximately half of the previous methods. The code is available at https://github.com/STOVAGtz/RT-VIS.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
EditorsZhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages485-499
Number of pages15
ISBN (Print)9789819787913
DOIs
Publication statusPublished - 2025
Event7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China
Duration: 18 Oct 202420 Oct 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15040 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/TerritoryChina
CityUrumqi
Period18/10/2420/10/24

Keywords

  • Decoupled framework
  • Real-time
  • Video instance segmentation

Fingerprint

Dive into the research topics of 'RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework'. Together they form a unique fingerprint.

Cite this

Cao, T., & Zhao, S. (2025). RT-VIS: Real-Time Video Instance Segmentation with Light-Weight Decoupled Framework. In Z. Lin, H. Zha, M.-M. Cheng, R. He, C.-L. Liu, K. Ubul, W. Silamu, & J. Zhou (Eds.), Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings (pp. 485-499). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 15040 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-8792-0_34