Document classification for mining host pathogen protein-protein interactions

Guixian Xu; Lanlan Yin; Manabu Torii; Zhendong Niu; Cathy Wu; Zhangzhi Hu; Hongfang Liu

doi:10.1109/BIBM.2008.66

Document classification for mining host pathogen protein-protein interactions

Guixian Xu^*, Lanlan Yin, Manabu Torii, Zhendong Niu, Cathy Wu, Zhangzhi Hu, Hongfang Liu

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Citations (Scopus)

Abstract

Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.

Original language	English
Title of host publication	Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008
Pages	461-466
Number of pages	6
DOIs	https://doi.org/10.1109/BIBM.2008.66
Publication status	Published - 2008
Event	2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008 - Philadelphia, PA, United States Duration: 3 Nov 2008 → 5 Nov 2008

Publication series

Name	Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008

Conference

Conference	2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008
Country/Territory	United States
City	Philadelphia, PA
Period	3/11/08 → 5/11/08

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1109/BIBM.2008.66

Cite this

Xu, G., Yin, L., Torii, M., Niu, Z., Wu, C., Hu, Z., & Liu, H. (2008). Document classification for mining host pathogen protein-protein interactions. In Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008 (pp. 461-466). Article 4684940 (Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008). https://doi.org/10.1109/BIBM.2008.66

@inproceedings{d12e19ca5ef542df9124f1b1e79452d1,

title = "Document classification for mining host pathogen protein-protein interactions",

abstract = "Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.",

author = "Guixian Xu and Lanlan Yin and Manabu Torii and Zhendong Niu and Cathy Wu and Zhangzhi Hu and Hongfang Liu",

year = "2008",

doi = "10.1109/BIBM.2008.66",

language = "English",

isbn = "9780769534527",

series = "Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008",

pages = "461--466",

booktitle = "Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008",

note = "2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008 ; Conference date: 03-11-2008 Through 05-11-2008",

}

Xu, G, Yin, L, Torii, M, Niu, Z, Wu, C, Hu, Z & Liu, H 2008, Document classification for mining host pathogen protein-protein interactions. in Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008., 4684940, Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008, pp. 461-466, 2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008, Philadelphia, PA, United States, 3/11/08. https://doi.org/10.1109/BIBM.2008.66

Document classification for mining host pathogen protein-protein interactions. / Xu, Guixian; Yin, Lanlan; Torii, Manabu et al.
Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008. 2008. p. 461-466 4684940 (Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Document classification for mining host pathogen protein-protein interactions

AU - Xu, Guixian

AU - Yin, Lanlan

AU - Torii, Manabu

AU - Niu, Zhendong

AU - Wu, Cathy

AU - Hu, Zhangzhi

AU - Liu, Hongfang

PY - 2008

Y1 - 2008

N2 - Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.

AB - Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.

UR - http://www.scopus.com/inward/record.url?scp=58049158462&partnerID=8YFLogxK

U2 - 10.1109/BIBM.2008.66

DO - 10.1109/BIBM.2008.66

M3 - Conference contribution

AN - SCOPUS:58049158462

SN - 9780769534527

T3 - Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008

SP - 461

EP - 466

BT - Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008

T2 - 2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008

Y2 - 3 November 2008 through 5 November 2008

ER -

Document classification for mining host pathogen protein-protein interactions

Abstract

Publication series

Conference

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this