摘要
To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.
投稿的翻译标题 | A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain |
---|---|
源语言 | 繁体中文 |
页 | 499-508 |
页数 | 10 |
出版状态 | 已出版 - 2020 |
活动 | 19th Chinese National Conference on Computational Linguistic, CCL 2020 - Haikou, 中国 期限: 30 10月 2020 → 1 11月 2020 |
会议
会议 | 19th Chinese National Conference on Computational Linguistic, CCL 2020 |
---|---|
国家/地区 | 中国 |
市 | Haikou |
时期 | 30/10/20 → 1/11/20 |
关键词
- Data-sparse
- Judicial domain
- Tibetan-Chinese parallel corpus