Research on Uyghur morphological segmentation based on long sequence labeling method

Ruohao Yan, Huaping Zhang, Wushour Silamu, Askar Hamdulla

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

With the steady progress of the "One Belt, One Road" national cooperation initiative, the intelligent processing of languages along the route has become increasingly important for communication, and Uyghur is a representative language of agglutinative language. The Uyghur language comprises stems and affixes, and the data is sparse. Morphological segmentation separates Uyghur roots and affixes to solve the problem of data sparseness. First, This paper studies the characteristics of the Uyghur morphological segmentation task and proposes a long sequence labeling method. Secondly, BiLSTM networks learn word formation features, and then the CRF model is used to learn label features. Finally, it proposes a new evaluation method. This paper reproduces relevant research and conducts experiments on the public THUUyMorph corpus, and the model F1 value is 98.60%. Experiments show that the results of this paper are better than the current advanced Uyghur morphological segmentation model, and downstream task Uyghur-Chinese translation experiments prove its effectiveness. This scheme can transfer to other languages along this line, such as Turkish, which provides a new research idea for morphological segmentation.

源语言英语
主期刊名SPML 2022 - Proceedings of 2022 5th International Conference on Signal Processing and Machine Learning
出版商Association for Computing Machinery
268-274
页数7
ISBN(电子版)9781450396912
DOI
出版状态已出版 - 4 8月 2022
活动5th International Conference on Signal Processing and Machine Learning, SPML 2022 - Dalian, 中国
期限: 4 8月 20226 8月 2022

出版系列

姓名ACM International Conference Proceeding Series

会议

会议5th International Conference on Signal Processing and Machine Learning, SPML 2022
国家/地区中国
Dalian
时期4/08/226/08/22

指纹

探究 'Research on Uyghur morphological segmentation based on long sequence labeling method' 的科研主题。它们共同构成独一无二的指纹。

引用此