基于均值漂移算法的文本聚类数目优化研究

Huaming Zhao; Li Yu; Qiang Zhou

doi:10.11925/infotech.2096-3467.2018.1259

基于均值漂移算法的文本聚类数目优化研究

Translated title of the contribution: Determining Best Text Clustering Number with Mean Shift Algorithm

Huaming Zhao^*, Li Yu, Qiang Zhou

^*Corresponding author for this work

CAS - National Science Library

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

Translated title of the contribution	Determining Best Text Clustering Number with Mean Shift Algorithm
Original language	Chinese (Traditional)
Pages (from-to)	27-35
Number of pages	9
Journal	Data Analysis and Knowledge Discovery
Volume	3
Issue number	9
DOIs	https://doi.org/10.11925/infotech.2096-3467.2018.1259
Publication status	Published - Sept 2019
Externally published	Yes

Access to Document

10.11925/infotech.2096-3467.2018.1259

Cite this

@article{29b84db1c87446db97a3b8f614da7bf7,

title = "基于均值漂移算法的文本聚类数目优化研究",

abstract = "[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.",

keywords = "Clustering Validity, Mean Shift, Number of Clusters, Text Clustering",

author = "Huaming Zhao and Li Yu and Qiang Zhou",

note = "Publisher Copyright: {\textcopyright} 2019 The Author(s).",

year = "2019",

month = sep,

doi = "10.11925/infotech.2096-3467.2018.1259",

language = "繁体中文",

volume = "3",

pages = "27--35",

journal = "Data Analysis and Knowledge Discovery",

issn = "2096-3467",

publisher = "Chinese Academy of Sciences",

number = "9",

}

TY - JOUR

T1 - 基于均值漂移算法的文本聚类数目优化研究

AU - Zhao, Huaming

AU - Yu, Li

AU - Zhou, Qiang

PY - 2019/9

Y1 - 2019/9

N2 - [Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

AB - [Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

KW - Clustering Validity

KW - Mean Shift

KW - Number of Clusters

KW - Text Clustering

UR - http://www.scopus.com/inward/record.url?scp=85115313761&partnerID=8YFLogxK

U2 - 10.11925/infotech.2096-3467.2018.1259

DO - 10.11925/infotech.2096-3467.2018.1259

M3 - 文章

AN - SCOPUS:85115313761

SN - 2096-3467

VL - 3

SP - 27

EP - 35

JO - Data Analysis and Knowledge Discovery

JF - Data Analysis and Knowledge Discovery

IS - 9

ER -

基于均值漂移算法的文本聚类数目优化研究

Abstract

Access to Document

Other files and links

Fingerprint

Cite this