Research on Uyghur morphological segmentation based on long sequence labeling method

Ruohao Yan, Huaping Zhang, Wushour Silamu, Askar Hamdulla

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With the steady progress of the "One Belt, One Road" national cooperation initiative, the intelligent processing of languages along the route has become increasingly important for communication, and Uyghur is a representative language of agglutinative language. The Uyghur language comprises stems and affixes, and the data is sparse. Morphological segmentation separates Uyghur roots and affixes to solve the problem of data sparseness. First, This paper studies the characteristics of the Uyghur morphological segmentation task and proposes a long sequence labeling method. Secondly, BiLSTM networks learn word formation features, and then the CRF model is used to learn label features. Finally, it proposes a new evaluation method. This paper reproduces relevant research and conducts experiments on the public THUUyMorph corpus, and the model F1 value is 98.60%. Experiments show that the results of this paper are better than the current advanced Uyghur morphological segmentation model, and downstream task Uyghur-Chinese translation experiments prove its effectiveness. This scheme can transfer to other languages along this line, such as Turkish, which provides a new research idea for morphological segmentation.

Original languageEnglish
Title of host publicationSPML 2022 - Proceedings of 2022 5th International Conference on Signal Processing and Machine Learning
PublisherAssociation for Computing Machinery
Pages268-274
Number of pages7
ISBN (Electronic)9781450396912
DOIs
Publication statusPublished - 4 Aug 2022
Event5th International Conference on Signal Processing and Machine Learning, SPML 2022 - Dalian, China
Duration: 4 Aug 20226 Aug 2022

Publication series

NameACM International Conference Proceeding Series

Conference

Conference5th International Conference on Signal Processing and Machine Learning, SPML 2022
Country/TerritoryChina
CityDalian
Period4/08/226/08/22

Keywords

  • Uyghur
  • data sparsity
  • long sequence labeling method
  • morphological segmentation

Fingerprint

Dive into the research topics of 'Research on Uyghur morphological segmentation based on long sequence labeling method'. Together they form a unique fingerprint.

Cite this