Lightweight target speaker separation network based on joint training

Jing Wang; Hanyue Liu; Liang Xu; Wenjing Yang; Weiming Yi; Fang Liu

doi:10.1186/s13636-023-00317-3

Lightweight target speaker separation network based on joint training

Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi^*, Fang Liu

^*Corresponding author for this work

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.

Original language	English
Article number	53
Journal	Eurasip Journal on Audio, Speech, and Music Processing
Volume	2023
Issue number	1
DOIs	https://doi.org/10.1186/s13636-023-00317-3
Publication status	Published - Dec 2023

Keywords

Joint training
Lightweight network
Loss function
Target speaker separation

Access to Document

10.1186/s13636-023-00317-3

Cite this

@article{15ad4ff600a245f7bbcc9aff2941a3ea,

title = "Lightweight target speaker separation network based on joint training",

abstract = "Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model{\textquoteright}s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system{\textquoteright}s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.",

keywords = "Joint training, Lightweight network, Loss function, Target speaker separation",

author = "Jing Wang and Hanyue Liu and Liang Xu and Wenjing Yang and Weiming Yi and Fang Liu",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s).",

year = "2023",

month = dec,

doi = "10.1186/s13636-023-00317-3",

language = "English",

volume = "2023",

journal = "Eurasip Journal on Audio, Speech, and Music Processing",

issn = "1687-4714",

publisher = "Springer Publishing Company",

number = "1",

}

TY - JOUR

T1 - Lightweight target speaker separation network based on joint training

AU - Wang, Jing

AU - Liu, Hanyue

AU - Xu, Liang

AU - Yang, Wenjing

AU - Yi, Weiming

AU - Liu, Fang

PY - 2023/12

Y1 - 2023/12

N2 - Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.

AB - Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.

KW - Joint training

KW - Lightweight network

KW - Loss function

KW - Target speaker separation

UR - http://www.scopus.com/inward/record.url?scp=85178663973&partnerID=8YFLogxK

U2 - 10.1186/s13636-023-00317-3

DO - 10.1186/s13636-023-00317-3

M3 - Article

AN - SCOPUS:85178663973

SN - 1687-4714

VL - 2023

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

IS - 1

M1 - 53

ER -

Lightweight target speaker separation network based on joint training

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this