Neural network-based non-intrusive speech quality assessment using attention pooling function

Miao Liu; Jing Wang; Weiming Yi; Fang Liu

doi:10.1186/s13636-021-00209-4

Neural network-based non-intrusive speech quality assessment using attention pooling function

Miao Liu, Jing Wang^*, Weiming Yi, Fang Liu

^*Corresponding author for this work

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.

Original language	English
Article number	20
Journal	Eurasip Journal on Audio, Speech, and Music Processing
Volume	2021
Issue number	1
DOIs	https://doi.org/10.1186/s13636-021-00209-4
Publication status	Published - Dec 2021

Keywords

Attention pooling
CNN-BLSTM
Neural network
Non-intrusive
Speech quality assessment

Access to Document

10.1186/s13636-021-00209-4

Cite this

@article{45021e295832451d98c4de2ffe3b7d72,

title = "Neural network-based non-intrusive speech quality assessment using attention pooling function",

abstract = "Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.",

keywords = "Attention pooling, CNN-BLSTM, Neural network, Non-intrusive, Speech quality assessment",

author = "Miao Liu and Jing Wang and Weiming Yi and Fang Liu",

note = "Publisher Copyright: {\textcopyright} 2021, The Author(s).",

year = "2021",

month = dec,

doi = "10.1186/s13636-021-00209-4",

language = "English",

volume = "2021",

journal = "Eurasip Journal on Audio, Speech, and Music Processing",

issn = "1687-4714",

publisher = "Springer Publishing Company",

number = "1",

}

TY - JOUR

T1 - Neural network-based non-intrusive speech quality assessment using attention pooling function

AU - Liu, Miao

AU - Wang, Jing

AU - Yi, Weiming

AU - Liu, Fang

PY - 2021/12

Y1 - 2021/12

N2 - Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.

AB - Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.

KW - Attention pooling

KW - CNN-BLSTM

KW - Neural network

KW - Non-intrusive

KW - Speech quality assessment

UR - http://www.scopus.com/inward/record.url?scp=85106215512&partnerID=8YFLogxK

U2 - 10.1186/s13636-021-00209-4

DO - 10.1186/s13636-021-00209-4

M3 - Article

AN - SCOPUS:85106215512

SN - 1687-4714

VL - 2021

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

IS - 1

M1 - 20

ER -

Neural network-based non-intrusive speech quality assessment using attention pooling function

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this