Construction of Uighur-Chinese parallel corpus

J. L. Song, L. Dai

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

Uighur-Chinese parallel corpus is an important foundation of Uighur-Chinese cross-language information processing. As a corpus of minority language, its construction is relatively more difficult. In this paper, we discuss issues related to the construction. We firstly introduce the selection of corpus resources. Second, in order to accelerate the construction and improve the quality of the corpus, we develop an assistant construction system based on webpage content extraction and text duplication removal, etc. By using this system, we build a Uighur-Chinese parallel corpus with approximately 300,000 sentence pairs and a moderate size of dictionary of person name and place name. Finally, to evaluate the corpus, we build a demo Uighur-Chinese statistical translation system to explore the corpus. The result preliminarily verifies its effectiveness.

Original languageEnglish
Title of host publicationMultimedia, Communication and Computing Application - Proceedings of the International Conference on Multimedia, Communication and Computing Application, MCCA 2014
EditorsAlly Leung
PublisherCRC Press/Balkema
Pages353-356
Number of pages4
ISBN (Print)9781138027756
DOIs
Publication statusPublished - 2015
EventInternational Conference on Multimedia, Communication and Computing Application, MCCA 2014 - Xiamen, China
Duration: 15 Oct 201416 Oct 2014

Publication series

NameMultimedia, Communication and Computing Application - Proceedings of the International Conference on Multimedia, Communication and Computing Application, MCCA 2014

Conference

ConferenceInternational Conference on Multimedia, Communication and Computing Application, MCCA 2014
Country/TerritoryChina
CityXiamen
Period15/10/1416/10/14

Fingerprint

Dive into the research topics of 'Construction of Uighur-Chinese parallel corpus'. Together they form a unique fingerprint.

Cite this