Abstract
As one of the pivotal technologies leading towards embodied intelligence, audio-visual segmentation is geared towards achieving precise segmentation of sounding objects, offering vast application prospects in scenarios such as emergency rescue and natural exploration. Nevertheless, the performance of audio-visual segmentation technology encounters limitations stemming from challenges related to the adaptation and fusion of cross-modal information encoding, as well as the decoding and generation of masks. To address these issues, this paper explores the adaptation of multi-modal information based on a shared encoder by employing a neural architecture search method to design a hierarchical encoder cooperation module for enhanced information interaction. An intermediate loss is leveraged to help the encoder to keep spatial knowledge reserved. Furthermore, an audio-guided class-aware decoder is devised to guide the generation of masks. Our approach has yielded competitive experimental results across multiple datasets, thus substantiating its effectiveness.
| Original language | English |
|---|---|
| Article number | 127885 |
| Journal | Neurocomputing |
| Volume | 594 |
| DOIs | |
| Publication status | Published - 14 Aug 2024 |
Keywords
- Audio guidance
- Audio-visual segmentation
- Hierarchical encoder
- Neural architecture search
Fingerprint
Dive into the research topics of 'Enhance audio-visual segmentation with hierarchical encoder and audio guidance'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver