TY - GEN
T1 - Research on Uyghur morphological segmentation based on long sequence labeling method
AU - Yan, Ruohao
AU - Zhang, Huaping
AU - Silamu, Wushour
AU - Hamdulla, Askar
N1 - Publisher Copyright:
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2022/8/4
Y1 - 2022/8/4
N2 - With the steady progress of the "One Belt, One Road" national cooperation initiative, the intelligent processing of languages along the route has become increasingly important for communication, and Uyghur is a representative language of agglutinative language. The Uyghur language comprises stems and affixes, and the data is sparse. Morphological segmentation separates Uyghur roots and affixes to solve the problem of data sparseness. First, This paper studies the characteristics of the Uyghur morphological segmentation task and proposes a long sequence labeling method. Secondly, BiLSTM networks learn word formation features, and then the CRF model is used to learn label features. Finally, it proposes a new evaluation method. This paper reproduces relevant research and conducts experiments on the public THUUyMorph corpus, and the model F1 value is 98.60%. Experiments show that the results of this paper are better than the current advanced Uyghur morphological segmentation model, and downstream task Uyghur-Chinese translation experiments prove its effectiveness. This scheme can transfer to other languages along this line, such as Turkish, which provides a new research idea for morphological segmentation.
AB - With the steady progress of the "One Belt, One Road" national cooperation initiative, the intelligent processing of languages along the route has become increasingly important for communication, and Uyghur is a representative language of agglutinative language. The Uyghur language comprises stems and affixes, and the data is sparse. Morphological segmentation separates Uyghur roots and affixes to solve the problem of data sparseness. First, This paper studies the characteristics of the Uyghur morphological segmentation task and proposes a long sequence labeling method. Secondly, BiLSTM networks learn word formation features, and then the CRF model is used to learn label features. Finally, it proposes a new evaluation method. This paper reproduces relevant research and conducts experiments on the public THUUyMorph corpus, and the model F1 value is 98.60%. Experiments show that the results of this paper are better than the current advanced Uyghur morphological segmentation model, and downstream task Uyghur-Chinese translation experiments prove its effectiveness. This scheme can transfer to other languages along this line, such as Turkish, which provides a new research idea for morphological segmentation.
KW - Uyghur
KW - data sparsity
KW - long sequence labeling method
KW - morphological segmentation
UR - http://www.scopus.com/inward/record.url?scp=85141360830&partnerID=8YFLogxK
U2 - 10.1145/3556384.3556425
DO - 10.1145/3556384.3556425
M3 - Conference contribution
AN - SCOPUS:85141360830
T3 - ACM International Conference Proceeding Series
SP - 268
EP - 274
BT - SPML 2022 - Proceedings of 2022 5th International Conference on Signal Processing and Machine Learning
PB - Association for Computing Machinery
T2 - 5th International Conference on Signal Processing and Machine Learning, SPML 2022
Y2 - 4 August 2022 through 6 August 2022
ER -