Abstract
It is a common method to quantify the latent in audio generation and then use diffusion models to estimate noise or data from the corrupted data to generate the quantized latent. Unlike the method, we consider that the targets estimated by the diffusion model include both noise and data, rather than just one of them. Based on this idea and multi-task learning methods, we design the network Dual-Unet, which is simply modified by U-net and can estimate both noise and data simultaneously. Combining Dual-Unet and Variational AutoEncoders with Residual Vector Quantizer, we propose Multitask diffusion model(MTDiffusion), which can generate foley sound audio with a given label. We validate our proposed model on the DCASE task7B dataset. The experimental results show the effectiveness of our proposed model, and both subjective and objective metrics of the generated audio significantly exceed the baseline and the first place ranked on 2023 DCASE task7B.
Original language | English |
---|---|
Pages (from-to) | 486-490 |
Number of pages | 5 |
Journal | Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing |
DOIs | |
Publication status | Published - 2024 |
Externally published | Yes |
Event | 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024 |
Keywords
- diffusion
- foley sound generation
- Multi-task
- U-net