Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

Sze An Peter Tan; Guangyu Gao; Jia Zhao

doi:10.1007/978-3-031-53308-2_12

Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

Sze An Peter Tan, Guangyu Gao^*, Jia Zhao

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model’s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task’s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model’s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.

Original language	English
Title of host publication	MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings
Editors	Stevan Rudinac, Marcel Worring, Cynthia Liem, Alan Hanjalic, Björn Pór Jónsson, Yoko Yamakata, Bei Liu
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	156-169
Number of pages	14
ISBN (Print)	9783031533075
DOIs	https://doi.org/10.1007/978-3-031-53308-2_12
Publication status	Published - 2024
Event	30th International Conference on MultiMedia Modeling, MMM 2024 - Amsterdam, Netherlands Duration: 29 Jan 2024 → 2 Feb 2024

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	14555 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	30th International Conference on MultiMedia Modeling, MMM 2024
Country/Territory	Netherlands
City	Amsterdam
Period	29/01/24 → 2/02/24

Keywords

Audio-Visual Segmentation
Double-Skip Connection
Multi-Scale
Self-supervision

Access to Document

10.1007/978-3-031-53308-2_12

Cite this

Tan, S. A. P., Gao, G., & Zhao, J. (2024). Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning. In S. Rudinac, M. Worring, C. Liem, A. Hanjalic, B. P. Jónsson, Y. Yamakata, & B. Liu (Eds.), MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings (pp. 156-169). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14555 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-53308-2_12

Tan, Sze An Peter ; Gao, Guangyu ; Zhao, Jia. / Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning. MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. editor / Stevan Rudinac ; Marcel Worring ; Cynthia Liem ; Alan Hanjalic ; Björn Pór Jónsson ; Yoko Yamakata ; Bei Liu. Springer Science and Business Media Deutschland GmbH, 2024. pp. 156-169 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{537f57c5e46c433dae96ac4e1f7445cb,

title = "Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning",

abstract = "Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model{\textquoteright}s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task{\textquoteright}s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model{\textquoteright}s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.",

keywords = "Audio-Visual Segmentation, Double-Skip Connection, Multi-Scale, Self-supervision",

author = "Tan, {Sze An Peter} and Guangyu Gao and Jia Zhao",

note = "Publisher Copyright: {\textcopyright} 2024, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 30th International Conference on MultiMedia Modeling, MMM 2024 ; Conference date: 29-01-2024 Through 02-02-2024",

year = "2024",

doi = "10.1007/978-3-031-53308-2_12",

language = "English",

isbn = "9783031533075",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "156--169",

editor = "Stevan Rudinac and Marcel Worring and Cynthia Liem and Alan Hanjalic and J{\'o}nsson, {Bj{\"o}rn P{\'o}r} and Yoko Yamakata and Bei Liu",

booktitle = "MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings",

address = "Germany",

}

Tan, SAP, Gao, G & Zhao, J 2024, Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning. in S Rudinac, M Worring, C Liem, A Hanjalic, BP Jónsson, Y Yamakata & B Liu (eds), MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14555 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 156-169, 30th International Conference on MultiMedia Modeling, MMM 2024, Amsterdam, Netherlands, 29/01/24. https://doi.org/10.1007/978-3-031-53308-2_12

Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning. / Tan, Sze An Peter; Gao, Guangyu; Zhao, Jia.
MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. ed. / Stevan Rudinac; Marcel Worring; Cynthia Liem; Alan Hanjalic; Björn Pór Jónsson; Yoko Yamakata; Bei Liu. Springer Science and Business Media Deutschland GmbH, 2024. p. 156-169 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14555 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

AU - Tan, Sze An Peter

AU - Gao, Guangyu

AU - Zhao, Jia

PY - 2024

Y1 - 2024

N2 - Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model’s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task’s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model’s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.

AB - Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model’s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task’s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model’s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.

KW - Audio-Visual Segmentation

KW - Double-Skip Connection

KW - Multi-Scale

KW - Self-supervision

UR - http://www.scopus.com/inward/record.url?scp=85185703981&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-53308-2_12

DO - 10.1007/978-3-031-53308-2_12

M3 - Conference contribution

AN - SCOPUS:85185703981

SN - 9783031533075

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 156

EP - 169

BT - MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings

A2 - Rudinac, Stevan

A2 - Worring, Marcel

A2 - Liem, Cynthia

A2 - Hanjalic, Alan

A2 - Jónsson, Björn Pór

A2 - Yamakata, Yoko

A2 - Liu, Bei

PB - Springer Science and Business Media Deutschland GmbH

T2 - 30th International Conference on MultiMedia Modeling, MMM 2024

Y2 - 29 January 2024 through 2 February 2024

ER -

Tan SAP, Gao G, Zhao J. Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning. In Rudinac S, Worring M, Liem C, Hanjalic A, Jónsson BP, Yamakata Y, Liu B, editors, MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. Springer Science and Business Media Deutschland GmbH. 2024. p. 156-169. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-53308-2_12