摘要
[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.
投稿的翻译标题 | Determining Best Text Clustering Number with Mean Shift Algorithm |
---|---|
源语言 | 繁体中文 |
页(从-至) | 27-35 |
页数 | 9 |
期刊 | Data Analysis and Knowledge Discovery |
卷 | 3 |
期 | 9 |
DOI | |
出版状态 | 已出版 - 9月 2019 |
已对外发布 | 是 |
关键词
- Clustering Validity
- Mean Shift
- Number of Clusters
- Text Clustering