基于均值漂移算法的文本聚类数目优化研究

Huaming Zhao*, Li Yu, Qiang Zhou

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

3 引用 (Scopus)

摘要

[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

投稿的翻译标题Determining Best Text Clustering Number with Mean Shift Algorithm
源语言繁体中文
页(从-至)27-35
页数9
期刊Data Analysis and Knowledge Discovery
3
9
DOI
出版状态已出版 - 9月 2019
已对外发布

关键词

  • Clustering Validity
  • Mean Shift
  • Number of Clusters
  • Text Clustering

指纹

探究 '基于均值漂移算法的文本聚类数目优化研究' 的科研主题。它们共同构成独一无二的指纹。

引用此