TY - GEN
T1 - Multi-stream gated and pyramidal temporal convolutional neural networks for audio-visual speech separation in multi-talker environments
AU - Luo, Yiyu
AU - Wang, Jing
AU - Xu, Liang
AU - Yang, Lidong
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - Speech separation is the task of extracting target speech from noisy mixture. In applications like video telephones or video conferencing, lip movements of the target speaker are accessible, which can be leveraged for speech separation. This paper proposes a time-domain audio-visual speech separation model under multi-talker environments. The model receives audio-visual inputs including noisy mixture and speaker lip embedding, and reconstructs clean speech waveform for the target speaker. Once trained, the model can be flexibly applied to unknown number of total speakers. This paper introduces and investigates the multi-stream gating mechanism and pyramidal convolution in temporal convolutional neural networks for audio-visual speech separation task. Speaker- and noise-independent multi-talker separation experiments are conducted on GRID benchmark dataset. The experimental results demonstrate the proposed method achieves 3.9 dB and 1.0 dB SI-SNRi improvement when compared with audio-only and audio-visual baselines respectively, showing effectiveness of the proposed method.
AB - Speech separation is the task of extracting target speech from noisy mixture. In applications like video telephones or video conferencing, lip movements of the target speaker are accessible, which can be leveraged for speech separation. This paper proposes a time-domain audio-visual speech separation model under multi-talker environments. The model receives audio-visual inputs including noisy mixture and speaker lip embedding, and reconstructs clean speech waveform for the target speaker. Once trained, the model can be flexibly applied to unknown number of total speakers. This paper introduces and investigates the multi-stream gating mechanism and pyramidal convolution in temporal convolutional neural networks for audio-visual speech separation task. Speaker- and noise-independent multi-talker separation experiments are conducted on GRID benchmark dataset. The experimental results demonstrate the proposed method achieves 3.9 dB and 1.0 dB SI-SNRi improvement when compared with audio-only and audio-visual baselines respectively, showing effectiveness of the proposed method.
KW - Audio-visual speech separation
KW - Cocktail party problem
KW - Gating mechanism
KW - Pyramidal convolution
KW - Temporal convolutional neural networks
UR - http://www.scopus.com/inward/record.url?scp=85119186774&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-366
DO - 10.21437/Interspeech.2021-366
M3 - Conference contribution
AN - SCOPUS:85119186774
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2448
EP - 2452
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -