Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism

  • Cunhan Guo
  • , Heyan Huang*
  • , Yang Hao Zhou
  • , Changsen Yuan
  • , Danjie Han
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Reference audio-visual segmentation (Ref-AVS) aims to segment target objects in videos based on textual prompts, leveraging an understanding of both audio and visual information. This task imposes stringent requirements for the balanced utilization of multimodal data and comprehensive understanding of temporal-wise information. In this paper, we propose a new model, VoCa, building upon the Mamba backbone. VoCa incorporates vote mechanisms, including individual consideration and group discussion, to balance the contributions of different modalities, thereby enhancing adaptability and accuracy in complex scenarios. Furthermore, an audio-visual cache module is introduced to improve the model's ability to perceive temporal variations and dynamics across frames. Experimental results on the Ref-AVS benchmark demonstrate that VoCa surpasses existing methods across multiple metrics, showcasing its effectiveness in handling complex multimodal information and temporal dependencies. We applied VoCa to several other audio-visual collaborative tasks and achieved competitive results, demonstrating the generalization of our method.

Original languageEnglish
Article number129371
JournalExpert Systems with Applications
Volume297
DOIs
Publication statusPublished - 1 Feb 2026

Keywords

  • Computer vision
  • Deep learning
  • Mamba
  • Reference audio-visual segmentation

Fingerprint

Dive into the research topics of 'Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism'. Together they form a unique fingerprint.

Cite this