TY - GEN
T1 - A probabilistic model based on uncertainty for data clustering
AU - Yu, Yaxin
AU - Zhu, Xinhua
AU - Li, Miao
AU - Wang, Guoren
AU - Luo, Dan
PY - 2013
Y1 - 2013
N2 - Recently, all kinds of data in real-life have exploded in an unbelievable way. In order to manage these data, dataspace has been becoming a universal platform, which contains various kinds of data, such as unstructured data, semi-structured data and structured data. But how to cluster these data in dataspace in an efficient and accurate way to help the user manage and explore them is still an intractable problem. In the previous work, the uncertain relationship between term and topic is not considered sufficiently. There are many techniques to handle this problem and probability theory provides an effective way to deal with the uncertainty of clustering. As a result, we proposed a novel probability model based on topic terms, i.e., Probabilistic Term Similarity Model (PTSM) to tackle the uncertainty between term and topic. In this model, not only terms from various data but also structure information of semi-structured and structured data are considered. Each term is assigned a probability indicating how relevant it is to the topic. Then, according to the probability for each term, a probabilistic matrix is established for clustering various data. At last, extensive experiment results show that the clustering method based on this probabilistic model has excellent performance and outperforms some other classical algorithms.
AB - Recently, all kinds of data in real-life have exploded in an unbelievable way. In order to manage these data, dataspace has been becoming a universal platform, which contains various kinds of data, such as unstructured data, semi-structured data and structured data. But how to cluster these data in dataspace in an efficient and accurate way to help the user manage and explore them is still an intractable problem. In the previous work, the uncertain relationship between term and topic is not considered sufficiently. There are many techniques to handle this problem and probability theory provides an effective way to deal with the uncertainty of clustering. As a result, we proposed a novel probability model based on topic terms, i.e., Probabilistic Term Similarity Model (PTSM) to tackle the uncertainty between term and topic. In this model, not only terms from various data but also structure information of semi-structured and structured data are considered. Each term is assigned a probability indicating how relevant it is to the topic. Then, according to the probability for each term, a probabilistic matrix is established for clustering various data. At last, extensive experiment results show that the clustering method based on this probabilistic model has excellent performance and outperforms some other classical algorithms.
KW - data clustering
KW - dataspace
KW - probability
KW - topic
KW - uncertainty
UR - https://www.scopus.com/pages/publications/84873840769
U2 - 10.1007/978-3-642-36288-0_12
DO - 10.1007/978-3-642-36288-0_12
M3 - Conference contribution
AN - SCOPUS:84873840769
SN - 9783642362873
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 126
EP - 138
BT - Agents and Data Mining Interaction - 8th International Workshop, ADMI 2012, Revised Selected Papers
T2 - 8th International Workshop on Agents and Data Mining Interaction, ADMI 2012
Y2 - 4 June 2012 through 5 June 2012
ER -