Semi-supervised text classification from unlabeled documents using classassociated words

Hong Qi Han*, Dong Hua Zhu, Xue Feng Wang

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    9 Citations (Scopus)

    Abstract

    Automatically classifying text documents is an important field in machinelearning. Unsupervised text classification does not need training data but isoften criticized to cluster blindly. Supervised text classification needs largequantities of labeled training data to achieve high accuracy. However, inpractice, labeled samples are often difficult, expensive or time consuming toobtain. In the meanwhile, unlabeled documents can be collected easily owing tothe rapid developing Internet. Class associated words are the words whichrepresent the subject of classes and provide prior knowledge of classificationfor training a classifier. A learning algorithm, based on the combination ofExpectation-Maximization (EM) and a Naïve Bayes classifier, is introducedto classify documents from fully unlabeled documents using class associatedwords. Experimental results show that it has good classification capability withhigh accuracy, especially for those categories with small quantities ofsamples. In the algorithm, class associated words are used to set classificationconstraints during learning process to restrict to classify documents intocorresponding class labels and improve the classification accuracy.

    Original languageEnglish
    Title of host publication2009 International Conference on Computers and Industrial Engineering, CIE 2009
    PublisherIEEE Computer Society
    Pages1255-1260
    Number of pages6
    ISBN (Print)9781424441365
    DOIs
    Publication statusPublished - 2009
    Event2009 International Conference on Computers and Industrial Engineering, CIE 2009 - Troyes, France
    Duration: 6 Jul 20099 Jul 2009

    Publication series

    Name2009 International Conference on Computers and Industrial Engineering, CIE 2009

    Conference

    Conference2009 International Conference on Computers and Industrial Engineering, CIE 2009
    Country/TerritoryFrance
    CityTroyes
    Period6/07/099/07/09

    Keywords

    • Class associated words
    • Expectation-maximization
    • Naïve bayes
    • Semi-supervised
    • Text classification

    Fingerprint

    Dive into the research topics of 'Semi-supervised text classification from unlabeled documents using classassociated words'. Together they form a unique fingerprint.

    Cite this