TY - JOUR
T1 - A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN
AU - Li, Ruwei
AU - Sun, Xiaoyue
AU - Li, Tao
AU - Zhao, Fengnian
N1 - Publisher Copyright:
© 2020 Elsevier Inc.
PY - 2020/6
Y1 - 2020/6
N2 - In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.
AB - In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.
KW - Deep learning
KW - Multi-objective learning
KW - Post-processing
KW - STCNN
KW - Speech enhancement
UR - http://www.scopus.com/inward/record.url?scp=85082654446&partnerID=8YFLogxK
U2 - 10.1016/j.dsp.2020.102731
DO - 10.1016/j.dsp.2020.102731
M3 - Article
AN - SCOPUS:85082654446
SN - 1051-2004
VL - 101
JO - Digital Signal Processing: A Review Journal
JF - Digital Signal Processing: A Review Journal
M1 - 102731
ER -