Abstract
Video object segmentation (VOS) is a fundamental task in video analysis, aiming to accurately recognize and segment objects of interest within video sequences. Conventional methods, relying on memory networks to store single-frame appearance features, face challenges in computational efficiency and capturing dynamic visual information effectively. To address these limitations, we present a Video Decoupling Network (VDN) with a per-clip memory updating mechanism. Our approach is inspired by the dual-stream hypothesis of the human visual cortex and decomposes multiple previous video frames into fundamental elements: scene, motion, and instance. We propose the Unified Prior-based Spatio-temporal Decoupler (UPSD) algorithm, which parses multiple frames into basic elements in a unified manner. UPSD continuously stores elements over time, enabling adaptive integration of different cues based on task requirements. This decomposition mechanism facilitates comprehensive spatial-temporal information capture and rapid updating, leading to notable enhancements in overall VOS performance. Extensive experiments conducted on multiple VOS benchmarks validate the state-of-the-art accuracy, efficiency, generalizability, and robustness of our approach. Remarkably, VDN demonstrates a significant performance improvement and a substantial speed-up compared to previous state-of-the-art methods on multiple VOS benchmarks. It also exhibits excellent generalizability under domain shift and robustness against various noise types.
| Original language | English |
|---|---|
| Pages (from-to) | 1218-1230 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Image Processing |
| Volume | 35 |
| DOIs | |
| Publication status | Published - 2026 |
| Externally published | Yes |
Keywords
- Video object segmentation
- adaptive mixture-of-experts reconstruction
- unified prior-based spatio-temporal decoupler
- video decoupling networks
Fingerprint
Dive into the research topics of 'Video Decoupling Networks for Accurate, Efficient, Generalizable, and Robust Video Object Segmentation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver