Identify essential genes based on clustering based synthetic minority oversampling technique

Hua Shi, Chenjin Wu, Tao Bai*, Jiahai Chen, Yan Li, Hao Wu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.

Original languageEnglish
Article number106523
JournalComputers in Biology and Medicine
Volume153
DOIs
Publication statusPublished - Feb 2023

Keywords

  • Cluster-SMOTE
  • Essential gene
  • Human cell lines
  • Support vector machine

Fingerprint

Dive into the research topics of 'Identify essential genes based on clustering based synthetic minority oversampling technique'. Together they form a unique fingerprint.

Cite this