TY - GEN
T1 - Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning
AU - Tan, Sze An Peter
AU - Gao, Guangyu
AU - Zhao, Jia
N1 - Publisher Copyright:
© 2024, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2024
Y1 - 2024
N2 - Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model’s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task’s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model’s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.
AB - Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model’s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task’s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model’s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.
KW - Audio-Visual Segmentation
KW - Double-Skip Connection
KW - Multi-Scale
KW - Self-supervision
UR - http://www.scopus.com/inward/record.url?scp=85185703981&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-53308-2_12
DO - 10.1007/978-3-031-53308-2_12
M3 - Conference contribution
AN - SCOPUS:85185703981
SN - 9783031533075
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 156
EP - 169
BT - MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings
A2 - Rudinac, Stevan
A2 - Worring, Marcel
A2 - Liem, Cynthia
A2 - Hanjalic, Alan
A2 - Jónsson, Björn Pór
A2 - Yamakata, Yoko
A2 - Liu, Bei
PB - Springer Science and Business Media Deutschland GmbH
T2 - 30th International Conference on MultiMedia Modeling, MMM 2024
Y2 - 29 January 2024 through 2 February 2024
ER -