TY - GEN
T1 - Imbalanced text classification on host pathogen protein-protein interaction documents
AU - Xu, Guixian
AU - Niu, Zhendong
AU - Gao, Xu
AU - Liu, Hongfang
PY - 2010
Y1 - 2010
N2 - Important in understanding the fundamental processes governing cell biology. However, a large number of scientific findings about PPIs are buried in the growing volume of biomedical literature. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of negative documents. In this paper, we investigate the effects of feature selection and feature weighting as well as kernel function of Support Vector Machines (SVMs) on imbalanced two-class classification based on 1360 host-pathogen protein-protein interactions documents. The results show that the suitable feature weighting approach is the important factor for improving the classification performance. Adjusting cost sensitive parameter of radial basis function (RBF) kernel of SVM can decrease the minority class misclassification ratio and increase the classification accuracy on imbalanced documents classification. An automated classification system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions can been developed based on the experiment.
AB - Important in understanding the fundamental processes governing cell biology. However, a large number of scientific findings about PPIs are buried in the growing volume of biomedical literature. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of negative documents. In this paper, we investigate the effects of feature selection and feature weighting as well as kernel function of Support Vector Machines (SVMs) on imbalanced two-class classification based on 1360 host-pathogen protein-protein interactions documents. The results show that the suitable feature weighting approach is the important factor for improving the classification performance. Adjusting cost sensitive parameter of radial basis function (RBF) kernel of SVM can decrease the minority class misclassification ratio and increase the classification accuracy on imbalanced documents classification. An automated classification system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions can been developed based on the experiment.
KW - Imbalanced text classification
KW - Machine learning
KW - Protein-protein interaction
UR - http://www.scopus.com/inward/record.url?scp=77952594815&partnerID=8YFLogxK
U2 - 10.1109/ICCAE.2010.5451921
DO - 10.1109/ICCAE.2010.5451921
M3 - Conference contribution
AN - SCOPUS:77952594815
SN - 9781424455850
T3 - 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010
SP - 418
EP - 422
BT - 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010
T2 - 2nd International Conference on Computer and Automation Engineering, ICCAE 2010
Y2 - 26 February 2010 through 28 February 2010
ER -