基于均值漂移算法的文本聚类数目优化研究

Translated title of the contribution: Determining Best Text Clustering Number with Mean Shift Algorithm

Huaming Zhao*, Li Yu, Qiang Zhou

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

Translated title of the contributionDetermining Best Text Clustering Number with Mean Shift Algorithm
Original languageChinese (Traditional)
Pages (from-to)27-35
Number of pages9
JournalData Analysis and Knowledge Discovery
Volume3
Issue number9
DOIs
Publication statusPublished - Sept 2019
Externally publishedYes

Fingerprint

Dive into the research topics of 'Determining Best Text Clustering Number with Mean Shift Algorithm'. Together they form a unique fingerprint.

Cite this