Assembling Chinese-Mongolian speech corpus via crowdsourcing

Rihai Su, Shumin Shi*, Meng Zhao, Heyan Huang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.

Original languageEnglish
Title of host publicationAdvances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings
EditorsBen Niu, Hideyuki Takagi, Yuhui Shi, Ying Tan
PublisherSpringer Verlag
Pages547-5555
Number of pages5009
ISBN (Print)9783319618326
DOIs
Publication statusPublished - 2017
Event8th International Conference on Swarm Intelligence, ICSI 2017 - Fukuoka, Japan
Duration: 27 Jul 20171 Aug 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10386 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference8th International Conference on Swarm Intelligence, ICSI 2017
Country/TerritoryJapan
CityFukuoka
Period27/07/171/08/17

Keywords

  • Crowdsourcing
  • Mongolian
  • Speech corpus
  • WeChat

Fingerprint

Dive into the research topics of 'Assembling Chinese-Mongolian speech corpus via crowdsourcing'. Together they form a unique fingerprint.

Cite this