Semi-supervised text classification from unlabeled documents using classassociated words

Hong Qi Han*, Dong Hua Zhu, Xue Feng Wang

*此作品的通讯作者

    科研成果: 书/报告/会议事项章节会议稿件同行评审

    9 引用 (Scopus)

    摘要

    Automatically classifying text documents is an important field in machinelearning. Unsupervised text classification does not need training data but isoften criticized to cluster blindly. Supervised text classification needs largequantities of labeled training data to achieve high accuracy. However, inpractice, labeled samples are often difficult, expensive or time consuming toobtain. In the meanwhile, unlabeled documents can be collected easily owing tothe rapid developing Internet. Class associated words are the words whichrepresent the subject of classes and provide prior knowledge of classificationfor training a classifier. A learning algorithm, based on the combination ofExpectation-Maximization (EM) and a Naïve Bayes classifier, is introducedto classify documents from fully unlabeled documents using class associatedwords. Experimental results show that it has good classification capability withhigh accuracy, especially for those categories with small quantities ofsamples. In the algorithm, class associated words are used to set classificationconstraints during learning process to restrict to classify documents intocorresponding class labels and improve the classification accuracy.

    源语言英语
    主期刊名2009 International Conference on Computers and Industrial Engineering, CIE 2009
    出版商IEEE Computer Society
    1255-1260
    页数6
    ISBN(印刷版)9781424441365
    DOI
    出版状态已出版 - 2009
    活动2009 International Conference on Computers and Industrial Engineering, CIE 2009 - Troyes, 法国
    期限: 6 7月 20099 7月 2009

    出版系列

    姓名2009 International Conference on Computers and Industrial Engineering, CIE 2009

    会议

    会议2009 International Conference on Computers and Industrial Engineering, CIE 2009
    国家/地区法国
    Troyes
    时期6/07/099/07/09

    指纹

    探究 'Semi-supervised text classification from unlabeled documents using classassociated words' 的科研主题。它们共同构成独一无二的指纹。

    引用此