Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis

Chenglong Jiang, Ying Gao*, Wing W.Y. Ng, Jiyong Zhou, Jinghui Zhong, Hongzhong Zhen, Xiping Hu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.

Original languageEnglish
Article number128430
JournalNeurocomputing
Volume608
DOIs
Publication statusPublished - 1 Dec 2024
Externally publishedYes

Keywords

  • Local convolution
  • Naturalness
  • Semantic dependency
  • Text-to-speech synthesis
  • Tone

Cite this