Enhance audio-visual segmentation with hierarchical encoder and audio guidance

Cunhan Guo, Heyan Huang*, Yanghao Zhou

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

As one of the pivotal technologies leading towards embodied intelligence, audio-visual segmentation is geared towards achieving precise segmentation of sounding objects, offering vast application prospects in scenarios such as emergency rescue and natural exploration. Nevertheless, the performance of audio-visual segmentation technology encounters limitations stemming from challenges related to the adaptation and fusion of cross-modal information encoding, as well as the decoding and generation of masks. To address these issues, this paper explores the adaptation of multi-modal information based on a shared encoder by employing a neural architecture search method to design a hierarchical encoder cooperation module for enhanced information interaction. An intermediate loss is leveraged to help the encoder to keep spatial knowledge reserved. Furthermore, an audio-guided class-aware decoder is devised to guide the generation of masks. Our approach has yielded competitive experimental results across multiple datasets, thus substantiating its effectiveness.

Original languageEnglish
Article number127885
JournalNeurocomputing
Volume594
DOIs
Publication statusPublished - 14 Aug 2024

Keywords

  • Audio guidance
  • Audio-visual segmentation
  • Hierarchical encoder
  • Neural architecture search

Cite this