IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning

Yi Jun Tang; Yi He Pang; Bin Liu

doi:10.1093/bioinformatics/btaa667

IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning

Yi Jun Tang, Yi He Pang, Bin Liu^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

117 引用（Scopus）

摘要

Motivation: Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of IDRs is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the 'semantic space' to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. Results: In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to 'semantic space' to reflect the structure patterns with the help of predicted residue-residue contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was used to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction and IDP-Seq2Seq-G for both long and short disordered region predictions. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods.

源语言	英语
页（从-至）	5177-5186
页数	10
期刊	Bioinformatics
卷	36
期	21
DOI	https://doi.org/10.1093/bioinformatics/btaa667
出版状态	已出版 - 1 11月 2020

访问文件

10.1093/bioinformatics/btaa667

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{dd4804c8dbf94a859f7c4adb07fe1ecd,

title = "IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning",

abstract = "Motivation: Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of IDRs is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the 'semantic space' to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. Results: In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to 'semantic space' to reflect the structure patterns with the help of predicted residue-residue contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was used to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction and IDP-Seq2Seq-G for both long and short disordered region predictions. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods.",

author = "Tang, {Yi Jun} and Pang, {Yi He} and Bin Liu",

year = "2020",

month = nov,

day = "1",

doi = "10.1093/bioinformatics/btaa667",

language = "English",

volume = "36",

pages = "5177--5186",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "21",

}

TY - JOUR

T1 - IDP-Seq2Seq

T2 - Identification of intrinsically disordered regions based on sequence to sequence learning

AU - Tang, Yi Jun

AU - Pang, Yi He

AU - Liu, Bin

PY - 2020/11/1

Y1 - 2020/11/1

N2 - Motivation: Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of IDRs is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the 'semantic space' to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. Results: In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to 'semantic space' to reflect the structure patterns with the help of predicted residue-residue contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was used to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction and IDP-Seq2Seq-G for both long and short disordered region predictions. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods.

AB - Motivation: Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of IDRs is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the 'semantic space' to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. Results: In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to 'semantic space' to reflect the structure patterns with the help of predicted residue-residue contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was used to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction and IDP-Seq2Seq-G for both long and short disordered region predictions. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods.

UR - http://www.scopus.com/inward/record.url?scp=85095981910&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btaa667

DO - 10.1093/bioinformatics/btaa667

M3 - Article

C2 - 32702119

AN - SCOPUS:85095981910

SN - 1367-4803

VL - 36

SP - 5177

EP - 5186

JO - Bioinformatics

JF - Bioinformatics

IS - 21

ER -

IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning

摘要

访问文件

其它文件与链接

指纹

引用此