Abstract
[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.
Translated title of the contribution | Determining Best Text Clustering Number with Mean Shift Algorithm |
---|---|
Original language | Chinese (Traditional) |
Pages (from-to) | 27-35 |
Number of pages | 9 |
Journal | Data Analysis and Knowledge Discovery |
Volume | 3 |
Issue number | 9 |
DOIs | |
Publication status | Published - Sept 2019 |
Externally published | Yes |