Video Decoupling Networks for Accurate, Efficient, Generalizable, and Robust Video Object Segmentation

  • Jisheng Dang
  • , Huicheng Zheng*
  • , Yulan Guo*
  • , Jianhuang Lai
  • , Bin Hu*
  • , Tat Seng Chua
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Video object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.

Original languageEnglish
Pages (from-to)1218-1230
Number of pages13
JournalIEEE Transactions on Image Processing
Volume35
DOIs
Publication statusPublished - 2026
Externally publishedYes

Keywords

  • Video object segmentation
  • adaptive mixture-of-experts reconstruction
  • unified prior-based spatio-temporal decoupler
  • video decoupling networks

Fingerprint

Dive into the research topics of 'Video Decoupling Networks for Accurate, Efficient, Generalizable, and Robust Video Object Segmentation'. Together they form a unique fingerprint.

Cite this