SR-PFGM++ Based Consistency Model for Speech Enhancement

Xiao Cao, Shenghui Zhao*, Yajing Hu, Jing Wang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Diffusion models and the extended Poisson flow generative model (PFGM++) have been applied to speech enhancement. They are sampled via stochastic differential equation (SDE) or ordinary differential equation (ODE), but usually require a large number of sampling steps. Hence, we introduce the consistency models, which allow for high-quality one-step generation with non-adversarial training. Specifically, based on our previous work, SR-PFGM++ (PFGM++ combined with stochastic regeneration) is distilled to train consistency model, resulting in the proposed Consistency Model for speech enhancement. Test results on the VoiceBank-DEMAND dataset show that the proposed model significantly reduces the inference time relative to SR-PFGM++ while maintaining comparable performance. Besides, mismatch test results on the TIMIT+NOISE92 dataset demonstrate the generalization ability of the proposed model.

Original languageEnglish
Title of host publicationIEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331515669
DOIs
Publication statusPublished - 2024
Event2nd IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024 - Zhuhai, China
Duration: 22 Nov 202424 Nov 2024

Publication series

NameIEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024

Conference

Conference2nd IEEE International Conference on Signal, Information and Data Processing, ICSIDP 2024
Country/TerritoryChina
CityZhuhai
Period22/11/2424/11/24

Keywords

  • consistency distillation
  • consistency models
  • PFGM++
  • speech enhancement
  • stochastic regeneration

Fingerprint

Dive into the research topics of 'SR-PFGM++ Based Consistency Model for Speech Enhancement'. Together they form a unique fingerprint.

Cite this