TY - JOUR
T1 - Chinese lexical analysis using cascaded hidden Markov model
AU - Liu, Qun
AU - Zhang, Hua Ping
AU - Yu, Hong Kui
AU - Cheng, Xue Qi
PY - 2004/8
Y1 - 2004/8
N2 - This paper presents an approach for Chinese lexical analysis using cascaded hidden Markov model (CHMM), which aims to incorporate Chinese word segmentation, part-of-speech tagging, disambiguation and unknown words recognition into an integrated theoretical frame. A class-based HMM is applied in word segmentation, and in this model, unknown words are treated in the same way as common words listed in the lexicon. Unknown words are recognized with reliability on roles sequence tagged using Viterbi algorithm in roles HMM. As for disambiguation, the authors bring forth an n-shortest-path strategy that, in the early stage, reserves the top N segmentation results as candidates and covers more ambiguity. Various experiments show that each level in the CHMM contributes to Chinese lexical analysis. A CHMM-based system ICTCLAS is accomplished. The system ranked top in the official open evaluation, which was held by the 973 project in 2002. And ICTCLAS achieved 2 first ranks and 1 second rank in the first international word segmentation bakeoff held by SIGHAN (the ACL Special Interest Group on Chinese Language Processing) in 2003. It indicates that ICTCLAS is one of the best Chinese lexical analyzers. In a word, CHMM is effective for Chinese lexical analysis.
AB - This paper presents an approach for Chinese lexical analysis using cascaded hidden Markov model (CHMM), which aims to incorporate Chinese word segmentation, part-of-speech tagging, disambiguation and unknown words recognition into an integrated theoretical frame. A class-based HMM is applied in word segmentation, and in this model, unknown words are treated in the same way as common words listed in the lexicon. Unknown words are recognized with reliability on roles sequence tagged using Viterbi algorithm in roles HMM. As for disambiguation, the authors bring forth an n-shortest-path strategy that, in the early stage, reserves the top N segmentation results as candidates and covers more ambiguity. Various experiments show that each level in the CHMM contributes to Chinese lexical analysis. A CHMM-based system ICTCLAS is accomplished. The system ranked top in the official open evaluation, which was held by the 973 project in 2002. And ICTCLAS achieved 2 first ranks and 1 second rank in the first international word segmentation bakeoff held by SIGHAN (the ACL Special Interest Group on Chinese Language Processing) in 2003. It indicates that ICTCLAS is one of the best Chinese lexical analyzers. In a word, CHMM is effective for Chinese lexical analysis.
KW - Cascaded hidden Markov model
KW - Chinese lexical analysis
KW - ICTCLAS
KW - POS tagging
KW - Unknown words recognition
KW - Word segmentation
UR - http://www.scopus.com/inward/record.url?scp=5644272866&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:5644272866
SN - 1000-1239
VL - 41
SP - 1421
EP - 1429
JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development
JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development
IS - 8
ER -