TY - GEN
T1 - A two-stage approach for generating topic models
AU - Gao, Yang
AU - Xu, Yue
AU - Li, Yuefeng
AU - Liu, Bin
PY - 2013
Y1 - 2013
N2 - Topic modeling has been widely utilized in the fields of information retrieval, text mining, text classification etc. Most existing statistical topic modeling methods such as LDA and pLSA generate a term based representation to represent a topic by selecting single words from multinomial word distribution over this topic. There are two main shortcomings: firstly, popular or common words occur very often across different topics that bring ambiguity to understand topics; secondly, single words lack coherent semantic meaning to accurately represent topics. In order to overcome these problems, in this paper, we propose a two-stage model that combines text mining and pattern mining with statistical modeling to generate more discriminative and semantic rich topic representations. Experiments show that the optimized topic representations generated by the proposed methods outperform the typical statistical topic modeling method LDA in terms of accuracy and certainty.
AB - Topic modeling has been widely utilized in the fields of information retrieval, text mining, text classification etc. Most existing statistical topic modeling methods such as LDA and pLSA generate a term based representation to represent a topic by selecting single words from multinomial word distribution over this topic. There are two main shortcomings: firstly, popular or common words occur very often across different topics that bring ambiguity to understand topics; secondly, single words lack coherent semantic meaning to accurately represent topics. In order to overcome these problems, in this paper, we propose a two-stage model that combines text mining and pattern mining with statistical modeling to generate more discriminative and semantic rich topic representations. Experiments show that the optimized topic representations generated by the proposed methods outperform the typical statistical topic modeling method LDA in terms of accuracy and certainty.
KW - Entropy
KW - Tf-idf, frequent pattern mining
KW - Topic modeling
KW - Topic representation
UR - http://www.scopus.com/inward/record.url?scp=84893600265&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-37456-2_19
DO - 10.1007/978-3-642-37456-2_19
M3 - Conference contribution
AN - SCOPUS:84893600265
SN - 9783642374555
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 221
EP - 232
BT - Advances in Knowledge Discovery and Data Mining - 17th Pacific-Asia Conference, PAKDD 2013, Proceedings
T2 - 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2013
Y2 - 14 April 2013 through 17 April 2013
ER -