Abstract
To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.
Translated title of the contribution | A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain |
---|---|
Original language | Chinese (Traditional) |
Pages | 499-508 |
Number of pages | 10 |
Publication status | Published - 2020 |
Event | 19th Chinese National Conference on Computational Linguistic, CCL 2020 - Haikou, China Duration: 30 Oct 2020 → 1 Nov 2020 |
Conference
Conference | 19th Chinese National Conference on Computational Linguistic, CCL 2020 |
---|---|
Country/Territory | China |
City | Haikou |
Period | 30/10/20 → 1/11/20 |