VISUALLY GUIDED BINAURAL AUDIO GENERATION WITH CROSS-MODAL CONSISTENCY

Miao Liu, Jing Wang, Xinyuan Qian, Xiang Xie

Research output: Contribution to journalConference articlepeer-review

2 Citations (Scopus)

Abstract

Binaural audio delivers an immersive spatial auditory experience to human listeners, but most existing videos lack binaural audio due to the expertise required for recording environments. Recent studies have been dedicated to converting monaural audio into binaural ones conditioned on the visual inputs. In this paper, we propose a novel audio-visual spatialization network with two added audio decoders, which rely on carefully designed visual features to generate audio outputs for the left and right channels, respectively. In addition, we propose an audio-visual matching loss to further explore the correlation between binaural audio and the scene visual input. Experiment results show that the proposed method outperforms several state-of-the-art binaural audio generation methods on two benchmark datasets FAIR-Play and MUSIC-Stereo. Qualitative results are also presented to demonstrate the effectiveness of the proposed method.

Original languageEnglish
Pages (from-to)7980-7984
Number of pages5
JournalProceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
DOIs
Publication statusPublished - 2024
Event2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Keywords

  • audio-visual learning
  • binaural audio generation
  • cross-modal consistency

Cite this