Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

Jing WANG; Yiyu LUO; Weiming YI; Xiang XIE

doi:10.1587/transinf.2021EDP7020

Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

Jing WANG^*, Yiyu LUO, Weiming YI, Xiang XIE

^*此作品的通讯作者

信息与电子学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent subnetworks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.

源语言	英语
页（从-至）	766-777
页数	12
期刊	IEICE Transactions on Information and Systems
卷	105
期	4
DOI	https://doi.org/10.1587/transinf.2021EDP7020
出版状态	已出版 - 2022

访问文件

10.1587/transinf.2021EDP7020

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9826f36d169748db8637a76f5426f79e,

title = "Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments",

abstract = "Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent subnetworks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.",

keywords = "audio-visual speech separation, lip embedding, multi-head attention, multi-talker, time-frequency mask, transformer",

author = "Jing WANG and Yiyu LUO and Weiming YI and Xiang XIE",

note = "Publisher Copyright: {\textcopyright} 2022 The Institute of Electronics.",

year = "2022",

doi = "10.1587/transinf.2021EDP7020",

language = "English",

volume = "105",

pages = "766--777",

journal = "IEICE Transactions on Information and Systems",

issn = "0916-8532",

publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",

number = "4",

}

TY - JOUR

T1 - Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

AU - WANG, Jing

AU - LUO, Yiyu

AU - YI, Weiming

AU - XIE, Xiang

PY - 2022

Y1 - 2022

N2 - Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent subnetworks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.

AB - Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent subnetworks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.

KW - audio-visual speech separation

KW - lip embedding

KW - multi-head attention

KW - multi-talker

KW - time-frequency mask

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85129679994&partnerID=8YFLogxK

U2 - 10.1587/transinf.2021EDP7020

DO - 10.1587/transinf.2021EDP7020

M3 - Article

AN - SCOPUS:85129679994

SN - 0916-8532

VL - 105

SP - 766

EP - 777

JO - IEICE Transactions on Information and Systems

JF - IEICE Transactions on Information and Systems

IS - 4

ER -

Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

摘要

访问文件

其它文件与链接

指纹

引用此