An improved KNN text categorization algorithm by adopting cluster technology

Xiao Fei Zhang; He Yan Huang

An improved KNN text categorization algorithm by adopting cluster technology

Xiao Fei Zhang^*, He Yan Huang

^*Corresponding author for this work

Chinese Academy of Sciences

Research output: Contribution to journal › Article › peer-review

17 Citations (Scopus)

Abstract

k-Nearest Neighbor (KNN) algorithm has the advantage of high accuracy and stability. But the time complexity of KNN is directly proportional to the sample size, its classification speed is low and it is problematic to be put into practice in large-scale information processing. An improved KNN text categorization algorithm is proposed which classifies faster than the traditional KNN does. Firstly, some similar sample documents are combined into a center document through adopting automatic text clustering technology. Then, a large number of original samples are replaced with the small amount of sample cluster centers. Therefore, the calculation amount of KNN is reduced greatly and the classification is speeded up. The experimental results show that the time complexity of the proposed algorithm is decreased by one order of magnitude and its accuracy is approximately equal to those of the SVM and traditional KNN.

Original language	English
Pages (from-to)	936-940
Number of pages	5
Journal	Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence
Volume	22
Issue number	6
Publication status	Published - Dec 2009
Externally published	Yes

Keywords

Cluster center
Natural language processing (NLP)
Text categorization
Text clustering
k-Nearest neighbor (KNN)

Cite this

@article{f98c7591347c4b25addca53a5e32747b,

title = "An improved KNN text categorization algorithm by adopting cluster technology",

abstract = "k-Nearest Neighbor (KNN) algorithm has the advantage of high accuracy and stability. But the time complexity of KNN is directly proportional to the sample size, its classification speed is low and it is problematic to be put into practice in large-scale information processing. An improved KNN text categorization algorithm is proposed which classifies faster than the traditional KNN does. Firstly, some similar sample documents are combined into a center document through adopting automatic text clustering technology. Then, a large number of original samples are replaced with the small amount of sample cluster centers. Therefore, the calculation amount of KNN is reduced greatly and the classification is speeded up. The experimental results show that the time complexity of the proposed algorithm is decreased by one order of magnitude and its accuracy is approximately equal to those of the SVM and traditional KNN.",

keywords = "Cluster center, Natural language processing (NLP), Text categorization, Text clustering, k-Nearest neighbor (KNN)",

author = "Zhang, {Xiao Fei} and Huang, {He Yan}",

year = "2009",

month = dec,

language = "English",

volume = "22",

pages = "936--940",

journal = "Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence",

issn = "1003-6059",

publisher = "Science China Press",

number = "6",

}

TY - JOUR

T1 - An improved KNN text categorization algorithm by adopting cluster technology

AU - Zhang, Xiao Fei

AU - Huang, He Yan

PY - 2009/12

Y1 - 2009/12

N2 - k-Nearest Neighbor (KNN) algorithm has the advantage of high accuracy and stability. But the time complexity of KNN is directly proportional to the sample size, its classification speed is low and it is problematic to be put into practice in large-scale information processing. An improved KNN text categorization algorithm is proposed which classifies faster than the traditional KNN does. Firstly, some similar sample documents are combined into a center document through adopting automatic text clustering technology. Then, a large number of original samples are replaced with the small amount of sample cluster centers. Therefore, the calculation amount of KNN is reduced greatly and the classification is speeded up. The experimental results show that the time complexity of the proposed algorithm is decreased by one order of magnitude and its accuracy is approximately equal to those of the SVM and traditional KNN.

AB - k-Nearest Neighbor (KNN) algorithm has the advantage of high accuracy and stability. But the time complexity of KNN is directly proportional to the sample size, its classification speed is low and it is problematic to be put into practice in large-scale information processing. An improved KNN text categorization algorithm is proposed which classifies faster than the traditional KNN does. Firstly, some similar sample documents are combined into a center document through adopting automatic text clustering technology. Then, a large number of original samples are replaced with the small amount of sample cluster centers. Therefore, the calculation amount of KNN is reduced greatly and the classification is speeded up. The experimental results show that the time complexity of the proposed algorithm is decreased by one order of magnitude and its accuracy is approximately equal to those of the SVM and traditional KNN.

KW - Cluster center

KW - Natural language processing (NLP)

KW - Text categorization

KW - Text clustering

KW - k-Nearest neighbor (KNN)

UR - http://www.scopus.com/inward/record.url?scp=75349085496&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:75349085496

SN - 1003-6059

VL - 22

SP - 936

EP - 940

JO - Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence

JF - Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence

IS - 6

ER -

An improved KNN text categorization algorithm by adopting cluster technology

Abstract

Keywords

Other files and links

Fingerprint

Cite this