Skip to main navigation Skip to search Skip to main content

Swahili Speech Dataset Development and Improved Pre-training Method for Spoken Digit Recognition

  • Beijing Institute of Technology
  • University of Dar Es Salaam

Research output: Contribution to journalArticlepeer-review

Abstract

Speech dataset is an essential component in building commercial speech applications. However, low-resource languages such as Swahili lack such a resource that is vital for spoken digit recognition. For languages where such resources exist, they are usually insufficient. Thus, pre-training methods have been used with external resources to improve continuous speech recognition. However, to the best of our knowledge, no study has investigated the effect of pre-training methods specifically for spoken digit recognition. This study aimed at addressing these problems. First, we developed a Swahili spoken digit dataset for Swahili spoken digit recognition. Then, we investigated the effect of cross-lingual and multi-lingual pre-training methods on spoken digit recognition. Finally, we proposed an effective language-independent pre-training method for spoken digit recognition. The proposed method has the advantage of incorporating target language data during the pre-training stage that leads to an optimal solution when using less training data. Experiments on Swahili (being developed), English, and Gujarati datasets show that our method achieves better performance compared with all the baselines listed in this study.

Original languageEnglish
Article number3597494
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume22
Issue number7
DOIs
Publication statusPublished - 20 Jul 2023

Keywords

  • Additional Key Words and PhrasesSwahili language
  • convolutional neural network
  • cross-lingual
  • low-resource language
  • multi-lingual
  • pre-training
  • spoken digit recognition

Fingerprint

Dive into the research topics of 'Swahili Speech Dataset Development and Improved Pre-training Method for Spoken Digit Recognition'. Together they form a unique fingerprint.

Cite this