Abstract
Binaural audio delivers an immersive spatial auditory experience to human listeners, but most existing videos lack binaural audio due to the expertise required for recording environments. Recent studies have been dedicated to converting monaural audio into binaural ones conditioned on the visual inputs. In this paper, we propose a novel audio-visual spatialization network with two added audio decoders, which rely on carefully designed visual features to generate audio outputs for the left and right channels, respectively. In addition, we propose an audio-visual matching loss to further explore the correlation between binaural audio and the scene visual input. Experiment results show that the proposed method outperforms several state-of-the-art binaural audio generation methods on two benchmark datasets FAIR-Play and MUSIC-Stereo. Qualitative results are also presented to demonstrate the effectiveness of the proposed method.
Original language | English |
---|---|
Pages (from-to) | 7980-7984 |
Number of pages | 5 |
Journal | Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing |
DOIs | |
Publication status | Published - 2024 |
Event | 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024 |
Keywords
- audio-visual learning
- binaural audio generation
- cross-modal consistency