Assembling Chinese-Mongolian speech corpus via crowdsourcing

Rihai Su, Shumin Shi*, Meng Zhao, Heyan Huang

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.

源语言英语
主期刊名Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings
编辑Ben Niu, Hideyuki Takagi, Yuhui Shi, Ying Tan
出版商Springer Verlag
547-5555
页数5009
ISBN(印刷版)9783319618326
DOI
出版状态已出版 - 2017
活动8th International Conference on Swarm Intelligence, ICSI 2017 - Fukuoka, 日本
期限: 27 7月 20171 8月 2017

出版系列

姓名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
10386 LNCS
ISSN(印刷版)0302-9743
ISSN(电子版)1611-3349

会议

会议8th International Conference on Swarm Intelligence, ICSI 2017
国家/地区日本
Fukuoka
时期27/07/171/08/17

指纹

探究 'Assembling Chinese-Mongolian speech corpus via crowdsourcing' 的科研主题。它们共同构成独一无二的指纹。

引用此