摘要
Video object segmentation is one of the most practical computer vision tasks, especially in the unsupervised case, which has no manually labeled segmentation mask at the beginning of a video sequence. In this paper, we propose a new real-time unsupervised video object segmentation network. Based on the encoder-decoder framework, we present a Dynamic ASPP module and a RNN-Conv module. The former adds a dynamic selection mechanism into the Astrous Spatial Pyramid Pooling structure, and then the dilated convolutional kernels adaptively select appropriate features according to the scales by the channel attention mechanism. Compared with directly concatenating the dilated convolutional features, dynamically selecting feature maps reduces the amount of parameters and makes the module more efficient. The RNN-Conv module incorporates the RNN units with external convolutional blocks, aggregating the temporal features of a video sequence with the spatial information extracted by the convolutional network. We stack this module to extract deeper spatiotemporal features than the traditional RNN network. This module helps to avoid the gradient disappearance and explosion during network training. We test our network on the popular video object segmentation datasets. The experiment results demonstrate the effectiveness of our model.1
源语言 | 英语 |
---|---|
文章编号 | 108120 |
期刊 | Pattern Recognition |
卷 | 120 |
DOI | |
出版状态 | 已出版 - 12月 2021 |