TY - JOUR
T1 - Species-specific design of artificial promoters by transfer-learning based generative deep-learning model
AU - Xia, Yan
AU - Du, Xiaowen
AU - Liu, Bin
AU - Guo, Shuyuan
AU - Huo, Yi Xin
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/6/24
Y1 - 2024/6/24
N2 - Native prokaryotic promoters share common sequence patterns, but are species dependent. For understudied species with limited data, it is challenging to predict the strength of existing promoters and generate novel promoters. Here, we developed PromoGen, a collection of nucleotide language models to generate species-specific functional promoters, across dozens of species in a data and parameter efficient way. Twenty-seven species-specific models in this collection were finetuned from the pretrained model which was trained on multi-species promoters. When systematically compared with native promoters, the Escherichia coli- and Bacillus subtilis-specific artificial PromoGen-generated promoters (PGPs) were demonstrated to hold all distribution patterns of native promoters. A regression model was developed to score generated either by PromoGen or by another competitive neural network, and the overall score of PGPs is higher. Encouraged by in silico analysis, we further experimentally characterized twenty-two B. subtilis PGPs, results showed that four of tested PGPs reached the strong promoter level while all were active. Furthermore, we developed a user-friendly website to generate species-specific promoters for 27 different species by PromoGen. This work presented an efficient deep-learning strategy for de novo species-specific promoter generation even with limited datasets, providing valuable promoter toolboxes especially for the metabolic engineering of understudied microorganisms.
AB - Native prokaryotic promoters share common sequence patterns, but are species dependent. For understudied species with limited data, it is challenging to predict the strength of existing promoters and generate novel promoters. Here, we developed PromoGen, a collection of nucleotide language models to generate species-specific functional promoters, across dozens of species in a data and parameter efficient way. Twenty-seven species-specific models in this collection were finetuned from the pretrained model which was trained on multi-species promoters. When systematically compared with native promoters, the Escherichia coli- and Bacillus subtilis-specific artificial PromoGen-generated promoters (PGPs) were demonstrated to hold all distribution patterns of native promoters. A regression model was developed to score generated either by PromoGen or by another competitive neural network, and the overall score of PGPs is higher. Encouraged by in silico analysis, we further experimentally characterized twenty-two B. subtilis PGPs, results showed that four of tested PGPs reached the strong promoter level while all were active. Furthermore, we developed a user-friendly website to generate species-specific promoters for 27 different species by PromoGen. This work presented an efficient deep-learning strategy for de novo species-specific promoter generation even with limited datasets, providing valuable promoter toolboxes especially for the metabolic engineering of understudied microorganisms.
UR - http://www.scopus.com/inward/record.url?scp=85197915832&partnerID=8YFLogxK
U2 - 10.1093/nar/gkae429
DO - 10.1093/nar/gkae429
M3 - Article
C2 - 38783063
AN - SCOPUS:85197915832
SN - 0305-1048
VL - 52
SP - 6145
EP - 6157
JO - Nucleic Acids Research
JF - Nucleic Acids Research
IS - 11
ER -