面向司法领域的高质量开源藏汉平行语料库构建

Jiu Sha, Luqin Zhou, Chong Feng, Hongzheng Li, Tianfu Zhang, Hui Hui

科研成果: 会议稿件论文同行评审

摘要

To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.

投稿的翻译标题A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain
源语言繁体中文
499-508
页数10
出版状态已出版 - 2020
活动19th Chinese National Conference on Computational Linguistic, CCL 2020 - Haikou, 中国
期限: 30 10月 20201 11月 2020

会议

会议19th Chinese National Conference on Computational Linguistic, CCL 2020
国家/地区中国
Haikou
时期30/10/201/11/20

关键词

  • Data-sparse
  • Judicial domain
  • Tibetan-Chinese parallel corpus

指纹

探究 '面向司法领域的高质量开源藏汉平行语料库构建' 的科研主题。它们共同构成独一无二的指纹。

引用此