Abstract
Reference audio-visual segmentation (Ref-AVS) aims to segment target objects in videos based on textual prompts, leveraging an understanding of both audio and visual information. This task imposes stringent requirements for the balanced utilization of multimodal data and comprehensive understanding of temporal-wise information. In this paper, we propose a new model, VoCa, building upon the Mamba backbone. VoCa incorporates vote mechanisms, including individual consideration and group discussion, to balance the contributions of different modalities, thereby enhancing adaptability and accuracy in complex scenarios. Furthermore, an audio-visual cache module is introduced to improve the model's ability to perceive temporal variations and dynamics across frames. Experimental results on the Ref-AVS benchmark demonstrate that VoCa surpasses existing methods across multiple metrics, showcasing its effectiveness in handling complex multimodal information and temporal dependencies. We applied VoCa to several other audio-visual collaborative tasks and achieved competitive results, demonstrating the generalization of our method.
| Original language | English |
|---|---|
| Article number | 129371 |
| Journal | Expert Systems with Applications |
| Volume | 297 |
| DOIs | |
| Publication status | Published - 1 Feb 2026 |
Keywords
- Computer vision
- Deep learning
- Mamba
- Reference audio-visual segmentation
Fingerprint
Dive into the research topics of 'Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver