Mtdiffusion: Multi-Task Diffusion Model With Dual-Unet for Foley Sound Generation

Anbin Qi, Xiang Xie*, Jing Wang

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

1 Citation (Scopus)

Abstract

It is a common method to quantify the latent in audio generation and then use diffusion models to estimate noise or data from the corrupted data to generate the quantized latent. Unlike the method, we consider that the targets estimated by the diffusion model include both noise and data, rather than just one of them. Based on this idea and multi-task learning methods, we design the network Dual-Unet, which is simply modified by U-net and can estimate both noise and data simultaneously. Combining Dual-Unet and Variational AutoEncoders with Residual Vector Quantizer, we propose Multitask diffusion model(MTDiffusion), which can generate foley sound audio with a given label. We validate our proposed model on the DCASE task7B dataset. The experimental results show the effectiveness of our proposed model, and both subjective and objective metrics of the generated audio significantly exceed the baseline and the first place ranked on 2023 DCASE task7B.

Original languageEnglish
Pages (from-to)486-490
Number of pages5
JournalProceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Keywords

  • diffusion
  • foley sound generation
  • Multi-task
  • U-net

Fingerprint

Dive into the research topics of 'Mtdiffusion: Multi-Task Diffusion Model With Dual-Unet for Foley Sound Generation'. Together they form a unique fingerprint.

Cite this