Assembling Chinese-Mongolian speech corpus via crowdsourcing

Rihai Su; Shumin Shi; Meng Zhao; Heyan Huang

doi:10.1007/978-3-319-61833-3_58

Assembling Chinese-Mongolian speech corpus via crowdsourcing

Rihai Su, Shumin Shi^*, Meng Zhao, Heyan Huang

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.

Original language	English
Title of host publication	Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings
Editors	Ben Niu, Hideyuki Takagi, Yuhui Shi, Ying Tan
Publisher	Springer Verlag
Pages	547-5555
Number of pages	5009
ISBN (Print)	9783319618326
DOIs	https://doi.org/10.1007/978-3-319-61833-3_58
Publication status	Published - 2017
Event	8th International Conference on Swarm Intelligence, ICSI 2017 - Fukuoka, Japan Duration: 27 Jul 2017 → 1 Aug 2017

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	10386 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	8th International Conference on Swarm Intelligence, ICSI 2017
Country/Territory	Japan
City	Fukuoka
Period	27/07/17 → 1/08/17

Keywords

Crowdsourcing
Mongolian
Speech corpus
WeChat

Access to Document

10.1007/978-3-319-61833-3_58

Cite this

Su, R., Shi, S., Zhao, M., & Huang, H. (2017). Assembling Chinese-Mongolian speech corpus via crowdsourcing. In B. Niu, H. Takagi, Y. Shi, & Y. Tan (Eds.), Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings (pp. 547-5555). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10386 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-61833-3_58

Su, Rihai ; Shi, Shumin ; Zhao, Meng et al. / Assembling Chinese-Mongolian speech corpus via crowdsourcing. Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings. editor / Ben Niu ; Hideyuki Takagi ; Yuhui Shi ; Ying Tan. Springer Verlag, 2017. pp. 547-5555 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{3a87affe8d194423a12e7399f68b74d1,

title = "Assembling Chinese-Mongolian speech corpus via crowdsourcing",

abstract = "Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.",

keywords = "Crowdsourcing, Mongolian, Speech corpus, WeChat",

author = "Rihai Su and Shumin Shi and Meng Zhao and Heyan Huang",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.; 8th International Conference on Swarm Intelligence, ICSI 2017 ; Conference date: 27-07-2017 Through 01-08-2017",

year = "2017",

doi = "10.1007/978-3-319-61833-3_58",

language = "English",

isbn = "9783319618326",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "547--5555",

editor = "Ben Niu and Hideyuki Takagi and Yuhui Shi and Ying Tan",

booktitle = "Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings",

address = "Germany",

}

Su, R, Shi, S, Zhao, M & Huang, H 2017, Assembling Chinese-Mongolian speech corpus via crowdsourcing. in B Niu, H Takagi, Y Shi & Y Tan (eds), Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10386 LNCS, Springer Verlag, pp. 547-5555, 8th International Conference on Swarm Intelligence, ICSI 2017, Fukuoka, Japan, 27/07/17. https://doi.org/10.1007/978-3-319-61833-3_58

Assembling Chinese-Mongolian speech corpus via crowdsourcing. / Su, Rihai; Shi, Shumin; Zhao, Meng et al.
Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings. ed. / Ben Niu; Hideyuki Takagi; Yuhui Shi; Ying Tan. Springer Verlag, 2017. p. 547-5555 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10386 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Assembling Chinese-Mongolian speech corpus via crowdsourcing

AU - Su, Rihai

AU - Shi, Shumin

AU - Zhao, Meng

AU - Huang, Heyan

N1 - Publisher Copyright: © Springer International Publishing AG 2017.

PY - 2017

Y1 - 2017

N2 - Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.

AB - Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.

KW - Crowdsourcing

KW - Mongolian

KW - Speech corpus

KW - WeChat

UR - http://www.scopus.com/inward/record.url?scp=85026739996&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-61833-3_58

DO - 10.1007/978-3-319-61833-3_58

M3 - Conference contribution

AN - SCOPUS:85026739996

SN - 9783319618326

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 547

EP - 5555

BT - Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings

A2 - Niu, Ben

A2 - Takagi, Hideyuki

A2 - Shi, Yuhui

A2 - Tan, Ying

PB - Springer Verlag

T2 - 8th International Conference on Swarm Intelligence, ICSI 2017

Y2 - 27 July 2017 through 1 August 2017

ER -

Su R, Shi S, Zhao M, Huang H. Assembling Chinese-Mongolian speech corpus via crowdsourcing. In Niu B, Takagi H, Shi Y, Tan Y, editors, Advances in Swarm Intelligence - 8th International Conference, ICSI 2017, Proceedings. Springer Verlag. 2017. p. 547-5555. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-61833-3_58

Assembling Chinese-Mongolian speech corpus via crowdsourcing

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this