面向司法领域的高质量开源藏汉平行语料库构建

Jiu Sha; Luqin Zhou; Chong Feng; Hongzheng Li; Tianfu Zhang; Hui Hui

面向司法领域的高质量开源藏汉平行语料库构建

Jiu Sha, Luqin Zhou, Chong Feng, Hongzheng Li, Tianfu Zhang, Hui Hui

科研成果: 会议稿件 › 论文 › 同行评审

摘要

To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.

投稿的翻译标题	A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain
源语言	繁体中文
页	499-508
页数	10
出版状态	已出版 - 2020
活动	19th Chinese National Conference on Computational Linguistic, CCL 2020 - Haikou, 中国期限: 30 10月 2020 → 1 11月 2020

会议

会议	19th Chinese National Conference on Computational Linguistic, CCL 2020
国家/地区	中国
市	Haikou
时期	30/10/20 → 1/11/20

关键词

Data-sparse
Judicial domain
Tibetan-Chinese parallel corpus

其它文件与链接

链接到 Scopus 的出版物

引用此

@conference{8dafb15823c646cfbe39746a6fa33a0b,

title = "面向司法领域的高质量开源藏汉平行语料库构建",

abstract = "To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.",

keywords = "Data-sparse, Judicial domain, Tibetan-Chinese parallel corpus",

author = "Jiu Sha and Luqin Zhou and Chong Feng and Hongzheng Li and Tianfu Zhang and Hui Hui",

note = "Publisher Copyright: {\textcopyright} 2020 China National Conference on Computational Linguistics Published under Creative Commons Attribution 4.0 International License; 19th Chinese National Conference on Computational Linguistic, CCL 2020 ; Conference date: 30-10-2020 Through 01-11-2020",

year = "2020",

language = "繁体中文",

pages = "499--508",

}

TY - CONF

T1 - 面向司法领域的高质量开源藏汉平行语料库构建

AU - Sha, Jiu

AU - Zhou, Luqin

AU - Feng, Chong

AU - Li, Hongzheng

AU - Zhang, Tianfu

AU - Hui, Hui

PY - 2020

Y1 - 2020

N2 - To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.

AB - To date, the Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain confronts a data-sparse severe problem. In this work, we tackle the problem from two aspects: 1) judicial Tibetan needs more rigorous logical expression and professional terminology vocabulary than the public domain. However, there hardly exists the high-quality Ti-Zh corpus in the judicial domain, which contains professional terminology and syntactic structure. 2) It is challenging to construct a Ti-Zh parallel corpus due to the unique lexical expression and specific syntactic structure. To this end, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to be our prior knowledge, which can avoid the logical expression and domain terminology missing problems caused by the out-of-domain phenomenon. Secondly, we collect the instance data, such as judgment documents, from the official websites of Chinese courts in various places. To avoid losing the Tibetan lexical expressions and syntactic structures, we firstly search for Tibetan case data, followed by Chinese. Based on the above principles, we build a high-quality Tibetan-Chinese parallel corpus, which consists of the following methods: crawling corpus, document segmentation alignment detection, sentence boundary recognition, automatic corpus cleaning. Lastly, we construct a f 60,000 Ti-Zh parallel corpus of the judicial domain, and we evaluate the quality and robustness of the constructed corpus by performing a variety of translation models and cross-validation experiments. Besides, this corpus will be an open-source to provide to other researchers for related research.

KW - Data-sparse

KW - Judicial domain

KW - Tibetan-Chinese parallel corpus

UR - http://www.scopus.com/inward/record.url?scp=85123953894&partnerID=8YFLogxK

M3 - 论文

AN - SCOPUS:85123953894

SP - 499

EP - 508

T2 - 19th Chinese National Conference on Computational Linguistic, CCL 2020

Y2 - 30 October 2020 through 1 November 2020

ER -

面向司法领域的高质量开源藏汉平行语料库构建

摘要

会议

关键词

其它文件与链接

指纹

引用此