基于均值漂移算法的文本聚类数目优化研究

Huaming Zhao; Li Yu; Qiang Zhou

doi:10.11925/infotech.2096-3467.2018.1259

基于均值漂移算法的文本聚类数目优化研究

Huaming Zhao^*, Li Yu, Qiang Zhou

^*此作品的通讯作者

CAS - National Science Library

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

投稿的翻译标题	Determining Best Text Clustering Number with Mean Shift Algorithm
源语言	繁体中文
页（从-至）	27-35
页数	9
期刊	Data Analysis and Knowledge Discovery
卷	3
期	9
DOI	https://doi.org/10.11925/infotech.2096-3467.2018.1259
出版状态	已出版 - 9月 2019
已对外发布	是

关键词

Clustering Validity
Mean Shift
Number of Clusters
Text Clustering

访问文件

10.11925/infotech.2096-3467.2018.1259

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{29b84db1c87446db97a3b8f614da7bf7,

title = "基于均值漂移算法的文本聚类数目优化研究",

abstract = "[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.",

keywords = "Clustering Validity, Mean Shift, Number of Clusters, Text Clustering",

author = "Huaming Zhao and Li Yu and Qiang Zhou",

note = "Publisher Copyright: {\textcopyright} 2019 The Author(s).",

year = "2019",

month = sep,

doi = "10.11925/infotech.2096-3467.2018.1259",

language = "繁体中文",

volume = "3",

pages = "27--35",

journal = "Data Analysis and Knowledge Discovery",

issn = "2096-3467",

publisher = "Chinese Academy of Sciences",

number = "9",

}

TY - JOUR

T1 - 基于均值漂移算法的文本聚类数目优化研究

AU - Zhao, Huaming

AU - Yu, Li

AU - Zhou, Qiang

PY - 2019/9

Y1 - 2019/9

N2 - [Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

AB - [Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

KW - Clustering Validity

KW - Mean Shift

KW - Number of Clusters

KW - Text Clustering

UR - http://www.scopus.com/inward/record.url?scp=85115313761&partnerID=8YFLogxK

U2 - 10.11925/infotech.2096-3467.2018.1259

DO - 10.11925/infotech.2096-3467.2018.1259

M3 - 文章

AN - SCOPUS:85115313761

SN - 2096-3467

VL - 3

SP - 27

EP - 35

JO - Data Analysis and Knowledge Discovery

JF - Data Analysis and Knowledge Discovery

IS - 9

ER -

基于均值漂移算法的文本聚类数目优化研究

摘要

关键词

访问文件

其它文件与链接

指纹

引用此