Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

Sze An Peter Tan, Guangyu Gao*, Jia Zhao

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model’s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task’s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model’s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.

源语言英语
主期刊名MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings
编辑Stevan Rudinac, Marcel Worring, Cynthia Liem, Alan Hanjalic, Björn Pór Jónsson, Yoko Yamakata, Bei Liu
出版商Springer Science and Business Media Deutschland GmbH
156-169
页数14
ISBN(印刷版)9783031533075
DOI
出版状态已出版 - 2024
活动30th International Conference on MultiMedia Modeling, MMM 2024 - Amsterdam, 荷兰
期限: 29 1月 20242 2月 2024

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
14555 LNCS
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议30th International Conference on MultiMedia Modeling, MMM 2024
国家/地区荷兰
Amsterdam
时期29/01/242/02/24

指纹

探究 'Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning' 的科研主题。它们共同构成独一无二的指纹。

引用此