A pragmatic model for new Chinese word extraction

Haijun Zhang*, Heyan Huang, Chaoyong Zhu, Shumin Shi

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Citations (Scopus)

Abstract

This paper proposed a pragmatic model for repeat-based Chinese New Word Extraction (NWE). It contains two innovations. The first is a formal description for the process of NWE, which gives instructions on feature selection in theory. On the basis of this, the Conditional Random Fields model (CRF) is selected as statistical framework to solve the formal description. The second is an improved algorithm for left (right) entropy to improve the efficiency of NWE. By comparing with baseline algorithm, the improved algorithm can enhance the computational speed of entropy remarkably. On the whole, experiments show that the model this paper proposed is very effective, and the F score is 49.72% in open test and 69.83% in word extraction respectively, which is an evident improvement over previous similar works.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE, 2010
DOIs
Publication statusPublished - 2010
Event6th International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2010 - Beijing, China
Duration: 21 Aug 201023 Aug 2010

Publication series

NameProceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2010

Conference

Conference6th International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2010
Country/TerritoryChina
CityBeijing
Period21/08/1023/08/10

Keywords

  • Computational efficiency
  • Formal description
  • Left (right) entropy
  • New words extraction
  • Repeat

Fingerprint

Dive into the research topics of 'A pragmatic model for new Chinese word extraction'. Together they form a unique fingerprint.

Cite this