TY - JOUR
T1 - Leveraging mamba for reference audio-visual segmentation with vote and cache mechanism
AU - Guo, Cunhan
AU - Huang, Heyan
AU - Zhou, Yang Hao
AU - Yuan, Changsen
AU - Han, Danjie
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/2/1
Y1 - 2026/2/1
N2 - Reference audio-visual segmentation (Ref-AVS) aims to segment target objects in videos based on textual prompts, leveraging an understanding of both audio and visual information. This task imposes stringent requirements for the balanced utilization of multimodal data and comprehensive understanding of temporal-wise information. In this paper, we propose a new model, VoCa, building upon the Mamba backbone. VoCa incorporates vote mechanisms, including individual consideration and group discussion, to balance the contributions of different modalities, thereby enhancing adaptability and accuracy in complex scenarios. Furthermore, an audio-visual cache module is introduced to improve the model's ability to perceive temporal variations and dynamics across frames. Experimental results on the Ref-AVS benchmark demonstrate that VoCa surpasses existing methods across multiple metrics, showcasing its effectiveness in handling complex multimodal information and temporal dependencies. We applied VoCa to several other audio-visual collaborative tasks and achieved competitive results, demonstrating the generalization of our method.
AB - Reference audio-visual segmentation (Ref-AVS) aims to segment target objects in videos based on textual prompts, leveraging an understanding of both audio and visual information. This task imposes stringent requirements for the balanced utilization of multimodal data and comprehensive understanding of temporal-wise information. In this paper, we propose a new model, VoCa, building upon the Mamba backbone. VoCa incorporates vote mechanisms, including individual consideration and group discussion, to balance the contributions of different modalities, thereby enhancing adaptability and accuracy in complex scenarios. Furthermore, an audio-visual cache module is introduced to improve the model's ability to perceive temporal variations and dynamics across frames. Experimental results on the Ref-AVS benchmark demonstrate that VoCa surpasses existing methods across multiple metrics, showcasing its effectiveness in handling complex multimodal information and temporal dependencies. We applied VoCa to several other audio-visual collaborative tasks and achieved competitive results, demonstrating the generalization of our method.
KW - Computer vision
KW - Deep learning
KW - Mamba
KW - Reference audio-visual segmentation
UR - https://www.scopus.com/pages/publications/105013566838
U2 - 10.1016/j.eswa.2025.129371
DO - 10.1016/j.eswa.2025.129371
M3 - Article
AN - SCOPUS:105013566838
SN - 0957-4174
VL - 297
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 129371
ER -