Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm

Yi Kun Tang; Xian Ling Mao; Heyan Huang

doi:10.1007/s10618-018-0555-0

Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm

Yi Kun Tang, Xian Ling Mao^*, Heyan Huang

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

9 引用（Scopus）

摘要

There is a mass of user-marked text data on the Internet, such as web pages with categories, papers with corresponding keywords, and tweets with hashtags. In recent years, supervised topic models, such as Labeled Latent Dirichlet Allocation, have been widely used to discover the abstract topics in labeled text corpora. However, none of these topic models have taken into consideration word order under the bag-of-words assumption, which will obviously lose a lot of semantic information. In this paper, in order to synchronously model semantical label information and word order, we propose a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order. In order to obtain the parameter estimation for the proposed LPLDA model, we develop a batch inference algorithm based on Gibbs sampling technique. Moreover, to accelerate the LPLDA’s processing speed for large-scale stream data, we further propose an online inference algorithm for LPLDA. Extensive experiments were conducted among LPLDA and four state-of-the-art baselines. The results show (1) batch LPLDA significantly outperforms baselines in terms of case study, perplexity and scalability, and the third party task in most cases; (2) the online algorithm for LPLDA is obviously more efficient than batch method under the premise of good results.

源语言	英语
页（从-至）	885-912
页数	28
期刊	Data Mining and Knowledge Discovery
卷	32
期	4
DOI	https://doi.org/10.1007/s10618-018-0555-0
出版状态	已出版 - 1 7月 2018

访问文件

10.1007/s10618-018-0555-0

其它文件与链接

链接到 Scopus 的出版物

引用此

Tang, Y. K., Mao, X. L., & Huang, H. (2018). Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm. Data Mining and Knowledge Discovery, 32(4), 885-912. https://doi.org/10.1007/s10618-018-0555-0

@article{416fb7744f22465da4a1c14c00c53c56,

title = "Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm",

abstract = "There is a mass of user-marked text data on the Internet, such as web pages with categories, papers with corresponding keywords, and tweets with hashtags. In recent years, supervised topic models, such as Labeled Latent Dirichlet Allocation, have been widely used to discover the abstract topics in labeled text corpora. However, none of these topic models have taken into consideration word order under the bag-of-words assumption, which will obviously lose a lot of semantic information. In this paper, in order to synchronously model semantical label information and word order, we propose a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order. In order to obtain the parameter estimation for the proposed LPLDA model, we develop a batch inference algorithm based on Gibbs sampling technique. Moreover, to accelerate the LPLDA{\textquoteright}s processing speed for large-scale stream data, we further propose an online inference algorithm for LPLDA. Extensive experiments were conducted among LPLDA and four state-of-the-art baselines. The results show (1) batch LPLDA significantly outperforms baselines in terms of case study, perplexity and scalability, and the third party task in most cases; (2) the online algorithm for LPLDA is obviously more efficient than batch method under the premise of good results.",

keywords = "Batch Labeled Phrase LDA, Labeled Phrase LDA, Online Labeled Phrase LDA, Topic model",

author = "Tang, {Yi Kun} and Mao, {Xian Ling} and Heyan Huang",

note = "Publisher Copyright: {\textcopyright} 2018, The Author(s).",

year = "2018",

month = jul,

day = "1",

doi = "10.1007/s10618-018-0555-0",

language = "English",

volume = "32",

pages = "885--912",

journal = "Data Mining and Knowledge Discovery",

issn = "1384-5810",

publisher = "Springer Netherlands",

number = "4",

}

TY - JOUR

T1 - Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm

AU - Tang, Yi Kun

AU - Mao, Xian Ling

AU - Huang, Heyan

PY - 2018/7/1

Y1 - 2018/7/1

N2 - There is a mass of user-marked text data on the Internet, such as web pages with categories, papers with corresponding keywords, and tweets with hashtags. In recent years, supervised topic models, such as Labeled Latent Dirichlet Allocation, have been widely used to discover the abstract topics in labeled text corpora. However, none of these topic models have taken into consideration word order under the bag-of-words assumption, which will obviously lose a lot of semantic information. In this paper, in order to synchronously model semantical label information and word order, we propose a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order. In order to obtain the parameter estimation for the proposed LPLDA model, we develop a batch inference algorithm based on Gibbs sampling technique. Moreover, to accelerate the LPLDA’s processing speed for large-scale stream data, we further propose an online inference algorithm for LPLDA. Extensive experiments were conducted among LPLDA and four state-of-the-art baselines. The results show (1) batch LPLDA significantly outperforms baselines in terms of case study, perplexity and scalability, and the third party task in most cases; (2) the online algorithm for LPLDA is obviously more efficient than batch method under the premise of good results.

AB - There is a mass of user-marked text data on the Internet, such as web pages with categories, papers with corresponding keywords, and tweets with hashtags. In recent years, supervised topic models, such as Labeled Latent Dirichlet Allocation, have been widely used to discover the abstract topics in labeled text corpora. However, none of these topic models have taken into consideration word order under the bag-of-words assumption, which will obviously lose a lot of semantic information. In this paper, in order to synchronously model semantical label information and word order, we propose a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order. In order to obtain the parameter estimation for the proposed LPLDA model, we develop a batch inference algorithm based on Gibbs sampling technique. Moreover, to accelerate the LPLDA’s processing speed for large-scale stream data, we further propose an online inference algorithm for LPLDA. Extensive experiments were conducted among LPLDA and four state-of-the-art baselines. The results show (1) batch LPLDA significantly outperforms baselines in terms of case study, perplexity and scalability, and the third party task in most cases; (2) the online algorithm for LPLDA is obviously more efficient than batch method under the premise of good results.

KW - Batch Labeled Phrase LDA

KW - Labeled Phrase LDA

KW - Online Labeled Phrase LDA

KW - Topic model

UR - http://www.scopus.com/inward/record.url?scp=85042628100&partnerID=8YFLogxK

U2 - 10.1007/s10618-018-0555-0

DO - 10.1007/s10618-018-0555-0

M3 - Article

AN - SCOPUS:85042628100

SN - 1384-5810

VL - 32

SP - 885

EP - 912

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

IS - 4

ER -

Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm

摘要

访问文件

其它文件与链接

指纹

引用此