A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN

Ruwei Li; Xiaoyue Sun; Tao Li; Fengnian Zhao

doi:10.1016/j.dsp.2020.102731

A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN

Ruwei Li^*, Xiaoyue Sun, Tao Li, Fengnian Zhao

^*此作品的通讯作者

Beijing University of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

15 引用（Scopus）

摘要

In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.

源语言	英语
文章编号	102731
期刊	Digital Signal Processing: A Review Journal
卷	101
DOI	https://doi.org/10.1016/j.dsp.2020.102731
出版状态	已出版 - 6月 2020
已对外发布	是

访问文件

10.1016/j.dsp.2020.102731

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{87119790d66c4d36bbca8cda29e42cfe,

title = "A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN",

abstract = "In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.",

keywords = "Deep learning, Multi-objective learning, Post-processing, STCNN, Speech enhancement",

author = "Ruwei Li and Xiaoyue Sun and Tao Li and Fengnian Zhao",

note = "Publisher Copyright: {\textcopyright} 2020 Elsevier Inc.",

year = "2020",

month = jun,

doi = "10.1016/j.dsp.2020.102731",

language = "English",

volume = "101",

journal = "Digital Signal Processing: A Review Journal",

issn = "1051-2004",

publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN

AU - Li, Ruwei

AU - Sun, Xiaoyue

AU - Li, Tao

AU - Zhao, Fengnian

PY - 2020/6

Y1 - 2020/6

N2 - In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.

AB - In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.

KW - Deep learning

KW - Multi-objective learning

KW - Post-processing

KW - STCNN

KW - Speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85082654446&partnerID=8YFLogxK

U2 - 10.1016/j.dsp.2020.102731

DO - 10.1016/j.dsp.2020.102731

M3 - Article

AN - SCOPUS:85082654446

SN - 1051-2004

VL - 101

JO - Digital Signal Processing: A Review Journal

JF - Digital Signal Processing: A Review Journal

M1 - 102731

ER -

A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN

摘要

访问文件

其它文件与链接

指纹

引用此