Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning

Sze An Peter Tan, Guangyu Gao*, Jia Zhao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Audio-visual segmentation with semantics (AVSS) is an advanced approach that enriches Audio-visual segmentation (AVS) by incorporating object classification, making it a more challenging task to accurately delineate sound-producing objects based on audio-visual cues. To achieve successful audio-visual learning, a model must accurately extract pixel-wise semantic information from images and effectively connect audio-visual data. However, the robustness of such models is often hindered by the limited availability of comprehensive examples in publicly accessible datasets. In this paper, we introduce an audio-visual attentive feature fusion module designed to guide the visual segmentation process by injecting audio semantics. This module is seamlessly integrated into a widely adopted U-Net-like model. Meanwhile, to enhance the model’s ability to capture both high and low-level features, we implement double-skip connections. Besides, in the exploration of intra- and inter-frame correspondences, we also propose an ensemble model proficient in learning two distinct tasks: frame-level and video-level segmentation. To address the task’s diverse demands, we introduce two model variants, one based on the ResNet architecture and the other based on the Swin Transformer model. Our approach leverages transfer learning and employs data augmentation techniques. Additionally, we introduce a custom regularization function aimed at enhancing the model’s robustness against unseen data while simultaneously improving segmentation boundary confidence through self-supervision. Extensive experiments demonstrate the effectiveness of our method as well as the significance of cross-modal perception and dependency modeling for this task.

Original languageEnglish
Title of host publicationMultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings
EditorsStevan Rudinac, Marcel Worring, Cynthia Liem, Alan Hanjalic, Björn Pór Jónsson, Yoko Yamakata, Bei Liu
PublisherSpringer Science and Business Media Deutschland GmbH
Pages156-169
Number of pages14
ISBN (Print)9783031533075
DOIs
Publication statusPublished - 2024
Event30th International Conference on MultiMedia Modeling, MMM 2024 - Amsterdam, Netherlands
Duration: 29 Jan 20242 Feb 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14555 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference30th International Conference on MultiMedia Modeling, MMM 2024
Country/TerritoryNetherlands
CityAmsterdam
Period29/01/242/02/24

Keywords

  • Audio-Visual Segmentation
  • Double-Skip Connection
  • Multi-Scale
  • Self-supervision

Fingerprint

Dive into the research topics of 'Audio-Visual Segmentation by Leveraging Multi-scaled Features Learning'. Together they form a unique fingerprint.

Cite this