TY - GEN
T1 - RT-VIS
T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
AU - Cao, Tianze
AU - Zhao, Sanyuan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Video Instance Segmentation (VIS) is a pivotal technology for various applications including autonomous driving and video editing. While existing approaches primarily focus on enhancing the accuracy on datasets, they often neglect the real-time performance of segmentation results. Current online video instance segmentation models can sequentially output instance segmentation results for each frame and associate instances across adjacent frames; however, their inference speed limits practical applications. Additionally, these models require substantial memory during training and inference, which hinders deployment. To address these issues, we introduce the RT-VIS model, which adopts a decoupled strategy for VIS, allowing the model to be divided into segmenter and tracker components, each trained independently. To enhance inference speed, we developed a new light-weight instance segmentation model and employed a novel tracker to facilitate inter-frame instance association. Our model balances inference speed and accuracy, achieving 22.6 FPS and 42.2 AP on the YouTube-VIS2019 dataset, with memory requirements during training and inference approximately half of the previous methods. The code is available at https://github.com/STOVAGtz/RT-VIS.
AB - Video Instance Segmentation (VIS) is a pivotal technology for various applications including autonomous driving and video editing. While existing approaches primarily focus on enhancing the accuracy on datasets, they often neglect the real-time performance of segmentation results. Current online video instance segmentation models can sequentially output instance segmentation results for each frame and associate instances across adjacent frames; however, their inference speed limits practical applications. Additionally, these models require substantial memory during training and inference, which hinders deployment. To address these issues, we introduce the RT-VIS model, which adopts a decoupled strategy for VIS, allowing the model to be divided into segmenter and tracker components, each trained independently. To enhance inference speed, we developed a new light-weight instance segmentation model and employed a novel tracker to facilitate inter-frame instance association. Our model balances inference speed and accuracy, achieving 22.6 FPS and 42.2 AP on the YouTube-VIS2019 dataset, with memory requirements during training and inference approximately half of the previous methods. The code is available at https://github.com/STOVAGtz/RT-VIS.
KW - Decoupled framework
KW - Real-time
KW - Video instance segmentation
UR - http://www.scopus.com/inward/record.url?scp=85209548701&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-8792-0_34
DO - 10.1007/978-981-97-8792-0_34
M3 - Conference contribution
AN - SCOPUS:85209548701
SN - 9789819787913
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 485
EP - 499
BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
A2 - Lin, Zhouchen
A2 - Zha, Hongbin
A2 - Cheng, Ming-Ming
A2 - He, Ran
A2 - Liu, Cheng-Lin
A2 - Ubul, Kurban
A2 - Silamu, Wushouer
A2 - Zhou, Jie
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 18 October 2024 through 20 October 2024
ER -