Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

Liulei Li; Wenguan Wang; Tianfei Zhou; Jianwu Li; Yi Yang

doi:10.1109/CVPR52729.2023.01794

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

Liulei Li, Wenguan Wang^*, Tianfei Zhou, Jianwu Li, Yi Yang

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

10 引用（Scopus）

摘要

The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply 'copying' labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.

源语言	英语
主期刊名	Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
出版商	IEEE Computer Society
页	18706-18716
页数	11
ISBN（电子版）	9798350301298
DOI	https://doi.org/10.1109/CVPR52729.2023.01794
出版状态	已出版 - 2023
活动	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, 加拿大期限: 18 6月 2023 → 22 6月 2023

出版系列

姓名	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
卷	2023-June
ISSN（印刷版）	1063-6919

会议

会议	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
国家/地区	加拿大
市	Vancouver
时期	18/06/23 → 22/06/23

访问文件

10.1109/CVPR52729.2023.01794

其它文件与链接

链接到 Scopus 的出版物

引用此

Li, L., Wang, W., Zhou, T., Li, J., & Yang, Y. (2023). Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation. 在 Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 (页码 18706-18716). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 卷 2023-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52729.2023.01794

Li, Liulei ; Wang, Wenguan ; Zhou, Tianfei 等. / Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation. Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. 页码 18706-18716 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

@inproceedings{81099c2354f84237901b8de3e9170507,

title = "Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation",

abstract = "The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply 'copying' labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.",

keywords = "Video: Low-level analysis, and tracking, motion",

author = "Liulei Li and Wenguan Wang and Tianfei Zhou and Jianwu Li and Yi Yang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.01794",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "18706--18716",

booktitle = "Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023",

address = "United States",

}

Li, L, Wang, W, Zhou, T , Li, J & Yang, Y 2023, Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation. 在 Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 卷 2023-June, IEEE Computer Society, 页码 18706-18716, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, 加拿大, 18/06/23. https://doi.org/10.1109/CVPR52729.2023.01794

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation. / Li, Liulei; Wang, Wenguan; Zhou, Tianfei 等.
Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. 页码 18706-18716 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 卷 2023-June).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

AU - Li, Liulei

AU - Wang, Wenguan

AU - Zhou, Tianfei

AU - Li, Jianwu

AU - Yang, Yi

PY - 2023

Y1 - 2023

N2 - The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply 'copying' labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.

AB - The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply 'copying' labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.

KW - Video: Low-level analysis

KW - and tracking

KW - motion

UR - http://www.scopus.com/inward/record.url?scp=85173872165&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.01794

DO - 10.1109/CVPR52729.2023.01794

M3 - Conference contribution

AN - SCOPUS:85173872165

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 18706

EP - 18716

BT - Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

PB - IEEE Computer Society

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

Y2 - 18 June 2023 through 22 June 2023

ER -

Li L, Wang W, Zhou T , Li J, Yang Y. Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation. 在 Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society. 2023. 页码 18706-18716. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52729.2023.01794

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此