TY - GEN
T1 - Document classification for mining host pathogen protein-protein interactions
AU - Xu, Guixian
AU - Yin, Lanlan
AU - Torii, Manabu
AU - Niu, Zhendong
AU - Wu, Cathy
AU - Hu, Zhangzhi
AU - Liu, Hongfang
PY - 2008
Y1 - 2008
N2 - Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.
AB - Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.
UR - http://www.scopus.com/inward/record.url?scp=58049158462&partnerID=8YFLogxK
U2 - 10.1109/BIBM.2008.66
DO - 10.1109/BIBM.2008.66
M3 - Conference contribution
AN - SCOPUS:58049158462
SN - 9780769534527
T3 - Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008
SP - 461
EP - 466
BT - Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008
T2 - 2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008
Y2 - 3 November 2008 through 5 November 2008
ER -