Real-time and light-weighted unsupervised video object segmentation network

Zongji Zhao; Sanyuan Zhao; Jianbing Shen

doi:10.1016/j.patcog.2021.108120

Real-time and light-weighted unsupervised video object segmentation network

Zongji Zhao, Sanyuan Zhao^*, Jianbing Shen

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

57 引用（Scopus）

摘要

Video object segmentation is one of the most practical computer vision tasks, especially in the unsupervised case, which has no manually labeled segmentation mask at the beginning of a video sequence. In this paper, we propose a new real-time unsupervised video object segmentation network. Based on the encoder-decoder framework, we present a Dynamic ASPP module and a RNN-Conv module. The former adds a dynamic selection mechanism into the Astrous Spatial Pyramid Pooling structure, and then the dilated convolutional kernels adaptively select appropriate features according to the scales by the channel attention mechanism. Compared with directly concatenating the dilated convolutional features, dynamically selecting feature maps reduces the amount of parameters and makes the module more efficient. The RNN-Conv module incorporates the RNN units with external convolutional blocks, aggregating the temporal features of a video sequence with the spatial information extracted by the convolutional network. We stack this module to extract deeper spatiotemporal features than the traditional RNN network. This module helps to avoid the gradient disappearance and explosion during network training. We test our network on the popular video object segmentation datasets. The experiment results demonstrate the effectiveness of our model.¹

源语言	英语
文章编号	108120
期刊	Pattern Recognition
卷	120
DOI	https://doi.org/10.1016/j.patcog.2021.108120
出版状态	已出版 - 12月 2021

访问文件

10.1016/j.patcog.2021.108120

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhao, Z., Zhao, S., & Shen, J. (2021). Real-time and light-weighted unsupervised video object segmentation network. Pattern Recognition, 120, 文章 108120. https://doi.org/10.1016/j.patcog.2021.108120

@article{da1a438b561a46658ca16464dcf9578c,

title = "Real-time and light-weighted unsupervised video object segmentation network",

abstract = "Video object segmentation is one of the most practical computer vision tasks, especially in the unsupervised case, which has no manually labeled segmentation mask at the beginning of a video sequence. In this paper, we propose a new real-time unsupervised video object segmentation network. Based on the encoder-decoder framework, we present a Dynamic ASPP module and a RNN-Conv module. The former adds a dynamic selection mechanism into the Astrous Spatial Pyramid Pooling structure, and then the dilated convolutional kernels adaptively select appropriate features according to the scales by the channel attention mechanism. Compared with directly concatenating the dilated convolutional features, dynamically selecting feature maps reduces the amount of parameters and makes the module more efficient. The RNN-Conv module incorporates the RNN units with external convolutional blocks, aggregating the temporal features of a video sequence with the spatial information extracted by the convolutional network. We stack this module to extract deeper spatiotemporal features than the traditional RNN network. This module helps to avoid the gradient disappearance and explosion during network training. We test our network on the popular video object segmentation datasets. The experiment results demonstrate the effectiveness of our model.1",

keywords = "Salient object detection, Unsupervised video object segmentation",

author = "Zongji Zhao and Sanyuan Zhao and Jianbing Shen",

note = "Publisher Copyright: {\textcopyright} 2021",

year = "2021",

month = dec,

doi = "10.1016/j.patcog.2021.108120",

language = "English",

volume = "120",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Real-time and light-weighted unsupervised video object segmentation network

AU - Zhao, Zongji

AU - Zhao, Sanyuan

AU - Shen, Jianbing

PY - 2021/12

Y1 - 2021/12

N2 - Video object segmentation is one of the most practical computer vision tasks, especially in the unsupervised case, which has no manually labeled segmentation mask at the beginning of a video sequence. In this paper, we propose a new real-time unsupervised video object segmentation network. Based on the encoder-decoder framework, we present a Dynamic ASPP module and a RNN-Conv module. The former adds a dynamic selection mechanism into the Astrous Spatial Pyramid Pooling structure, and then the dilated convolutional kernels adaptively select appropriate features according to the scales by the channel attention mechanism. Compared with directly concatenating the dilated convolutional features, dynamically selecting feature maps reduces the amount of parameters and makes the module more efficient. The RNN-Conv module incorporates the RNN units with external convolutional blocks, aggregating the temporal features of a video sequence with the spatial information extracted by the convolutional network. We stack this module to extract deeper spatiotemporal features than the traditional RNN network. This module helps to avoid the gradient disappearance and explosion during network training. We test our network on the popular video object segmentation datasets. The experiment results demonstrate the effectiveness of our model.1

AB - Video object segmentation is one of the most practical computer vision tasks, especially in the unsupervised case, which has no manually labeled segmentation mask at the beginning of a video sequence. In this paper, we propose a new real-time unsupervised video object segmentation network. Based on the encoder-decoder framework, we present a Dynamic ASPP module and a RNN-Conv module. The former adds a dynamic selection mechanism into the Astrous Spatial Pyramid Pooling structure, and then the dilated convolutional kernels adaptively select appropriate features according to the scales by the channel attention mechanism. Compared with directly concatenating the dilated convolutional features, dynamically selecting feature maps reduces the amount of parameters and makes the module more efficient. The RNN-Conv module incorporates the RNN units with external convolutional blocks, aggregating the temporal features of a video sequence with the spatial information extracted by the convolutional network. We stack this module to extract deeper spatiotemporal features than the traditional RNN network. This module helps to avoid the gradient disappearance and explosion during network training. We test our network on the popular video object segmentation datasets. The experiment results demonstrate the effectiveness of our model.1

KW - Salient object detection

KW - Unsupervised video object segmentation

UR - http://www.scopus.com/inward/record.url?scp=85109559343&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2021.108120

DO - 10.1016/j.patcog.2021.108120

M3 - Article

AN - SCOPUS:85109559343

SN - 0031-3203

VL - 120

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 108120

ER -

Real-time and light-weighted unsupervised video object segmentation network

摘要

访问文件

其它文件与链接

指纹

引用此