TY - JOUR
T1 - JointNet
T2 - Joint Learning for Simultaneous DOA Estimation and Speech Enhancement in Noisy and Reverberant Environments
AU - Xiong, Wenmeng
AU - Jia, Maoshen
AU - Zhou, Jing
AU - Zhang, Jing
AU - Shen, Qing
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2026
Y1 - 2026
N2 - In this paper, we design a joint learning network to simultaneously address the tasks of direction of arrival (DOA) estimation and speech enhancement. The proposed network consists of two DOA estimation blocks, a speech enhancement block, and two interaction blocks. Specifically, cross-narrowband modules are employed in both the DOA estimation block and the speech enhancement block in order to learn both the frequency dependencies and temporal correlations of time-frequency (TF) domain microphone signals. Bidirectional interaction blocks are designed to fully exploit the synergy between these two tasks by integrating DOA information of the sources into the speech enhancement block and integrating the enhanced high-quality signals from the speech enhancement block back into the DOA estimation blocks. In this way, the performance of both tasks can be improved compared with independent training. Experiments were conducted on two datasets: the first one is generated by convolving the simulated room impulse responses (RIRs) with clean speeches from LibriSpeech dataset, while in the second one the clean speeches from DNS Challenge dataset are convolved with both simulated RIRs and real-world recorded RIRs. The experimental results demonstrate that our proposed joint learning method can significantly improve the performance of both DOA estimation and speech enhancement tasks compared to baseline methods.
AB - In this paper, we design a joint learning network to simultaneously address the tasks of direction of arrival (DOA) estimation and speech enhancement. The proposed network consists of two DOA estimation blocks, a speech enhancement block, and two interaction blocks. Specifically, cross-narrowband modules are employed in both the DOA estimation block and the speech enhancement block in order to learn both the frequency dependencies and temporal correlations of time-frequency (TF) domain microphone signals. Bidirectional interaction blocks are designed to fully exploit the synergy between these two tasks by integrating DOA information of the sources into the speech enhancement block and integrating the enhanced high-quality signals from the speech enhancement block back into the DOA estimation blocks. In this way, the performance of both tasks can be improved compared with independent training. Experiments were conducted on two datasets: the first one is generated by convolving the simulated room impulse responses (RIRs) with clean speeches from LibriSpeech dataset, while in the second one the clean speeches from DNS Challenge dataset are convolved with both simulated RIRs and real-world recorded RIRs. The experimental results demonstrate that our proposed joint learning method can significantly improve the performance of both DOA estimation and speech enhancement tasks compared to baseline methods.
KW - Direction of arrival estimation
KW - convolutional neural network
KW - joint learning
KW - long short term memory
KW - speech enhancement
UR - https://www.scopus.com/pages/publications/105027337991
U2 - 10.1109/TASLPRO.2026.3651053
DO - 10.1109/TASLPRO.2026.3651053
M3 - Article
AN - SCOPUS:105027337991
SN - 1558-7916
VL - 34
SP - 596
EP - 611
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
ER -