TY - JOUR
T1 - Protein remote homology detection and fold recognition based on features extracted from frequency profiles
AU - Lin, Lei
AU - Liu, Bin
AU - Wang, Xiaolong
AU - Wang, Xuan
AU - Tang, Buzhou
PY - 2011
Y1 - 2011
N2 - Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. The performance of SVM depends on the method of protein vectorization, so a suitable representation of the protein sequence is a key step for the SVM-based methods. In this paper, two kinds of profile-level building blocks of proteins, binary profiles and N-nary profiles, have been presented, which contain the evolutionary information of the protein sequence frequency profile. The protein sequence frequency profiles calculated from the multiple sequence alignments outputted by PSIBLAST are converted into binary profiles or N-nary profiles. The protein sequences are transformed into fixeddimension feature vectors by the occurrence times of each binary profile or N-nary profile and then the corresponding vectors are inputted to support vector machines. The latent semantic analysis (LSA) model, an efficient feature extraction algorithm, is adopted to further improve the performance of our methods. Experiments with protein remote homology detection and fold recognition show that the methods based on profile-level building blocks give better results compared to related methods.
AB - Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. The performance of SVM depends on the method of protein vectorization, so a suitable representation of the protein sequence is a key step for the SVM-based methods. In this paper, two kinds of profile-level building blocks of proteins, binary profiles and N-nary profiles, have been presented, which contain the evolutionary information of the protein sequence frequency profile. The protein sequence frequency profiles calculated from the multiple sequence alignments outputted by PSIBLAST are converted into binary profiles or N-nary profiles. The protein sequences are transformed into fixeddimension feature vectors by the occurrence times of each binary profile or N-nary profile and then the corresponding vectors are inputted to support vector machines. The latent semantic analysis (LSA) model, an efficient feature extraction algorithm, is adopted to further improve the performance of our methods. Experiments with protein remote homology detection and fold recognition show that the methods based on profile-level building blocks give better results compared to related methods.
KW - Fold recognition
KW - Frequency profiles
KW - Latent semantic analysis
KW - Remote homology detection
KW - Support vector machine
UR - https://www.scopus.com/pages/publications/79951741386
U2 - 10.4304/jcp.6.2.321-328
DO - 10.4304/jcp.6.2.321-328
M3 - Article
AN - SCOPUS:79951741386
SN - 1796-203X
VL - 6
SP - 321
EP - 328
JO - Journal of Computers (Finland)
JF - Journal of Computers (Finland)
IS - 2
ER -