面向司法领域的高质量开源藏汉平行语料库构建

Translated title of the contribution: A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain

Jiu Sha, Luqin Zhou, Chong Feng, Hongzheng Li, Tianfu Zhang, Hui Hui

Research output: Contribution to conferencePaperpeer-review

Abstract

To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.

Translated title of the contributionA High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain
Original languageChinese (Traditional)
Pages499-508
Number of pages10
Publication statusPublished - 2020
Event19th Chinese National Conference on Computational Linguistic, CCL 2020 - Haikou, China
Duration: 30 Oct 20201 Nov 2020

Conference

Conference19th Chinese National Conference on Computational Linguistic, CCL 2020
Country/TerritoryChina
CityHaikou
Period30/10/201/11/20

Fingerprint

Dive into the research topics of 'A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain'. Together they form a unique fingerprint.

Cite this