TY - JOUR
T1 - A novel unsupervised method for new word extraction
AU - Mei, Lili
AU - Huang, Heyan
AU - Wei, Xiaochi
AU - Mao, Xianling
N1 - Publisher Copyright:
© 2016, Science China Press and Springer-Verlag Berlin Heidelberg.
PY - 2016/9/1
Y1 - 2016/9/1
N2 - New words could benefit many NLP tasks such as sentence chunking and sentiment analysis. However, automatic new word extraction is a challenging task because new words usually have no fixed language pattern, and even appear with the new meanings of existing words. To tackle these problems, this paper proposes a novel method to extract new words. It not only considers domain specificity, but also combines with multiple statistical language knowledge. First, we perform a filtering algorithm to obtain a candidate list of new words. Then, we employ the statistical language knowledge to extract the top ranked new words. Experimental results show that our proposed method is able to extract a large number of new words both in Chinese and English corpus, and notably outperforms the state-of-the-art methods. Moreover, we also demonstrate our method increases the accuracy of Chinese word segmentation by 10% on corpus containing new words.
AB - New words could benefit many NLP tasks such as sentence chunking and sentiment analysis. However, automatic new word extraction is a challenging task because new words usually have no fixed language pattern, and even appear with the new meanings of existing words. To tackle these problems, this paper proposes a novel method to extract new words. It not only considers domain specificity, but also combines with multiple statistical language knowledge. First, we perform a filtering algorithm to obtain a candidate list of new words. Then, we employ the statistical language knowledge to extract the top ranked new words. Experimental results show that our proposed method is able to extract a large number of new words both in Chinese and English corpus, and notably outperforms the state-of-the-art methods. Moreover, we also demonstrate our method increases the accuracy of Chinese word segmentation by 10% on corpus containing new words.
KW - domain specificity
KW - domain word extraction
KW - new word extraction
KW - statistical language knowledge
KW - word segmentation
UR - http://www.scopus.com/inward/record.url?scp=84983680573&partnerID=8YFLogxK
U2 - 10.1007/s11432-015-0906-9
DO - 10.1007/s11432-015-0906-9
M3 - Article
AN - SCOPUS:84983680573
SN - 1674-733X
VL - 59
JO - Science China Information Sciences
JF - Science China Information Sciences
IS - 9
M1 - 92102
ER -