TY - JOUR
T1 - Identify essential genes based on clustering based synthetic minority oversampling technique
AU - Shi, Hua
AU - Wu, Chenjin
AU - Bai, Tao
AU - Chen, Jiahai
AU - Li, Yan
AU - Wu, Hao
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2023/2
Y1 - 2023/2
N2 - Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.
AB - Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.
KW - Cluster-SMOTE
KW - Essential gene
KW - Human cell lines
KW - Support vector machine
UR - http://www.scopus.com/inward/record.url?scp=85146436545&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2022.106523
DO - 10.1016/j.compbiomed.2022.106523
M3 - Article
C2 - 36652869
AN - SCOPUS:85146436545
SN - 0010-4825
VL - 153
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 106523
ER -