跳到主要导航 跳到搜索 跳到主要内容

Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism

  • Cunhan Guo
  • , Heyan Huang*
  • , Yang Hao Zhou
  • , Changsen Yuan
  • , Danjie Han
  • *此作品的通讯作者
  • Beijing Institute of Technology
  • University of Chinese Academy of Sciences
  • Nanjing University of Science and Technology

科研成果: 期刊稿件文章同行评审

摘要

Reference audio-visual segmentation (Ref-AVS) aims to segment target objects in videos based on textual prompts, leveraging an understanding of both audio and visual information. This task imposes stringent requirements for the balanced utilization of multimodal data and comprehensive understanding of temporal-wise information. In this paper, we propose a new model, VoCa, building upon the Mamba backbone. VoCa incorporates vote mechanisms, including individual consideration and group discussion, to balance the contributions of different modalities, thereby enhancing adaptability and accuracy in complex scenarios. Furthermore, an audio-visual cache module is introduced to improve the model's ability to perceive temporal variations and dynamics across frames. Experimental results on the Ref-AVS benchmark demonstrate that VoCa surpasses existing methods across multiple metrics, showcasing its effectiveness in handling complex multimodal information and temporal dependencies. We applied VoCa to several other audio-visual collaborative tasks and achieved competitive results, demonstrating the generalization of our method.

源语言英语
文章编号129371
期刊Expert Systems with Applications
297
DOI
出版状态已出版 - 1 2月 2026

指纹

探究 'Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism' 的科研主题。它们共同构成独一无二的指纹。

引用此