Application of Conditional Random Fields model in Unknown Words Identification

Hai Jun Zhang*, Wei Min Pan, Shu Min Shi, Chao Yong Zhu

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Citations (Scopus)

Abstract

This paper proposed a method for Unknown Words Identification (UWI) based on repeats. To identify Unknown words with reliable theory, we put forward a formal model for the process of UWI, which can give directions on the selection of features used in UWI in theory. For the formal model, we propose employing Conditional Random Fields model (CRF) as statistical frame to resolve it. Under the statistical frame, UWI is converted to the process of exploiting effective features that can represent the essences of unknown words. The experiments show that the method of this paper is effective, and reasonable combination of features used in CRF can evidently improve the result of UWI. The ultimate result (F score) of this method is 47.81% and 69.83% in open test and word extraction respectively, which is better over the best result reported in previous works.

Original languageEnglish
Title of host publication2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010
Pages1839-1843
Number of pages5
DOIs
Publication statusPublished - 2010
Event2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010 - Qingdao, China
Duration: 11 Jul 201014 Jul 2010

Publication series

Name2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010
Volume4

Conference

Conference2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010
Country/TerritoryChina
CityQingdao
Period11/07/1014/07/10

Keywords

  • CRF
  • Chinese word segmentation
  • Feature combination
  • Repeats
  • Unknown Words Identification

Fingerprint

Dive into the research topics of 'Application of Conditional Random Fields model in Unknown Words Identification'. Together they form a unique fingerprint.

Cite this