Proteins play a pivotal role in coordinating the functions of organisms,essentially governing their traits,as the dynamic arrangement of diverse amino acids leads to a multitude of folded configurations within peptide...Proteins play a pivotal role in coordinating the functions of organisms,essentially governing their traits,as the dynamic arrangement of diverse amino acids leads to a multitude of folded configurations within peptide chains.Despite dynamic changes in amino acid composition of an individual protein(referred to as AAP)and great variance in protein expression levels under different conditions,our study,utilizing transcriptomics data from four model organisms uncovers surprising stability in the overall amino acid composition of the total cellular proteins(referred to as AACell).Although this value may vary between different species,we observed no significant differences among distinct strains of the same species.This indicates that organisms enforce system-level constraints to maintain a consistent AACell,even amid fluctuations in AAP and protein expression.Further exploration of this phenomenon promises insights into the intricate mechanisms orchestrating cellular protein expression and adaptation to varying environmental challenges.展开更多
Genome-scale metabolic models(GEMs)have been widely employed to predict microorganism behaviors.However,GEMs only consider stoichiometric constraints,leading to a linear increase in simulated growth and product yields...Genome-scale metabolic models(GEMs)have been widely employed to predict microorganism behaviors.However,GEMs only consider stoichiometric constraints,leading to a linear increase in simulated growth and product yields as substrate uptake rates rise.This divergence from experimental measurements prompted the creation of enzyme-constrained models(ecModels)for various species,successfully enhancing chemical pro-duction.Building upon studies that allocate macromolecule resources,we developed a Python-based workflow(ECMpy)that constructs an enzyme-constrained model.This involves directly imposing an enzyme amount constraint in GEM and accounting for protein subunit composition in reactions.However,this procedure de-mands manual collection of enzyme kinetic parameter information and subunit composition details,making it rather user-unfriendly.In this work,we’ve enhanced the ECMpy toolbox to version 2.0,broadening its scope to automatically generate ecGEMs for a wider array of organisms.ECMpy 2.0 automates the retrieval of enzyme kinetic parameters and employs machine learning for predicting these parameters,which significantly enhances parameter coverage.Additionally,ECMpy 2.0 introduces common analytical and visualization features for ecModels,rendering computational results more user accessible.Furthermore,ECMpy 2.0 seamlessly integrates three published algorithms that exploit ecModels to uncover potential targets for metabolic engineering.ECMpy 2.0 is available at https://github.com/tibbdc/ECMpy or as a pip package(https://pypi.org/project/ECMpy/).展开更多
Enzyme commission(EC)numbers,which associate a protein sequence with the biochemical reactions it catalyzes,are essential for the accurate understanding of enzyme functions and cellular metabolism.Many ab initio compu...Enzyme commission(EC)numbers,which associate a protein sequence with the biochemical reactions it catalyzes,are essential for the accurate understanding of enzyme functions and cellular metabolism.Many ab initio computational approaches were proposed to predict EC numbers for given input protein sequences.However,the prediction performance(accuracy,recall,and precision),usability,and efficiency of existing methods decreased seriously when dealing with recently discovered proteins,thus still having much room to be improved.Here,we report HDMLF,a hierarchical dual-core multitask learning framework for accurately predicting EC numbers based on novel deep learning techniques.HDMLF is composed of an embedding core and a learning core;the embedding core adopts the latest protein language model for protein sequence embedding,and the learning core conducts the EC number prediction.Specifically,HDMLF is designed on the basis of a gated recurrent unit framework to perform EC number prediction in the multi-objective hierarchy,multitasking manner.Additionally,we introduced an attention layer to optimize the EC prediction and employed a greedy strategy to integrate and fine-tune the final model.Comparative analyses against 4 representative methods demonstrate that HDMLF stably delivers the highest performance,which improves accuracy and F1 score by 60%and 40%over the state of the art,respectively.An additional case study of tyrB predicted to compensate for the loss of aspartate aminotransferase aspC,as reported in a previous experimental study,shows that our model can also be used to uncover the enzyme promiscuity.Finally,we established a web platform,namely,ECRECer(https://ecrecer.biodesign.ac.cn),using an entirely could-based serverless architecture and provided an offline bundle to improve usability.展开更多
Metabolic network models have become increasingly precise and accurate as the most widespread and practical digital representations of living cells.The prediction functions were significantly expanded by integrating c...Metabolic network models have become increasingly precise and accurate as the most widespread and practical digital representations of living cells.The prediction functions were significantly expanded by integrating cellular resources and abiotic constraints in recent years.However,if unreasonable modeling methods were adopted due to a lack of consideration of biological knowledge,the conflicts between stoichiometric and other constraints,such as thermodynamic feasibility and enzyme resource availability,would lead to distorted predictions.In this work,we investigated a prediction anomaly of EcoETM,a constraints-based metabolic network model,and introduced the idea of enzyme compartmentalization into the analysis process.Through rational combination of reactions,we avoid the false prediction of pathway feasibility caused by the unrealistic assumption of free intermediate metabolites.This allowed us to correct the pathway structures of L-serine and L-tryptophan.A specific analysis explains the application method of the EcoETM-like model and demonstrates its potential and value in correcting the prediction results in pathway structure by resolving the conflict between different constraints and incorporating the evolved roles of enzymes as reaction compartments.Notably,this work also reveals the trade-off between product yield and thermodynamic feasibility.Our work is of great value for the structural improvement of constraints-based models.展开更多
Pseudomonas stutzeri A1501 is a non-fluorescent denitrifying bacteria that belongs to the gram-negative bacterial group.As a prominent strain in the fields of agriculture and bioengineering,there is still a lack of co...Pseudomonas stutzeri A1501 is a non-fluorescent denitrifying bacteria that belongs to the gram-negative bacterial group.As a prominent strain in the fields of agriculture and bioengineering,there is still a lack of comprehensive understanding regarding its metabolic capabilities,specifically in terms of central metabolism and substrate utilization.Therefore,further exploration and extensive studies are required to gain a detailed insight into these aspects.This study reconstructed a genome-scale metabolic network model for P.stutzeri A1501 and conducted extensive curations,including correcting energy generation cycles,respiratory chains,and biomass composition.The final model,iQY1018,was successfully developed,covering more genes and reactions and having higher prediction accuracy compared with the previously published model iPB890.The substrate utilization ability of 71 carbon sources was investigated by BIOLOG experiment and was utilized to validate the model quality.The model prediction accuracy of substrate utilization for P.stutzeri A1501 reached 90%.The model analysis revealed its new ability in central metabolism and predicted that the strain is a suitable chassis for the production of Acetyl CoA-derived products.This work provides an updated,high-quality model of P.stutzeri A1501for further research and will further enhance our understanding of the metabolic capabilities.展开更多
Escherichia coli is a model organism with a clear genetic background that is widely used in metabolic engineering and synthetic biology research.To gain a complete picture of the complexly metabolic and regulatory int...Escherichia coli is a model organism with a clear genetic background that is widely used in metabolic engineering and synthetic biology research.To gain a complete picture of the complexly metabolic and regulatory interactions in E.coli,researchers often need to retrieve information from various databases which cover diferent types of interactions.A central one-stop service integrating various molecular interactions in E.coli would be helpful for the community.We constructed a database called E.coli integrated network(EcoIN)by integrating known molecular interaction information from databases and literature.EcoIN contains nearly 160,000 pairs of interactions and users can easily search the diferent types of interacting partners for a metabolite,gene or protein,and thus gain access to a more comprehensive interaction map of E.coli.To illustrate the application of EcoIN,we used the full path algorithm to identify metabolic feedback/feedforward regulatory loops having at least two diferent types of regulatory interactions.Applying this algorithm to analyze the regulatory loops for the amino acid biosynthetic pathways,we found some multi-step regulation loops which may afect the metabolic fux and are potential new engineering targets.The EcoIN database is freely accessible at http://ecoin.ibiodesign.net/and analysis codes are available at GitHub:https://github.com/maozhitao/EcoIN.展开更多
Revolutionary breakthroughs in artificial intelligence (AI) and machine learning (ML) have had a profound impact on a widerange of scientific disciplines, including the development of artificial cell factories for bio...Revolutionary breakthroughs in artificial intelligence (AI) and machine learning (ML) have had a profound impact on a widerange of scientific disciplines, including the development of artificial cell factories for biomanufacturing. In this paper, wereview the latest studies on the application of data-driven methods for the design of new proteins, pathways, and strains. Wefirst briefly introduce the various types of data and databases relevant to industrial biomanufacturing, which are the basis fordata-driven research. Different types of algorithms, including traditional ML and more recent deep learning methods, are alsopresented. We then demonstrate how these data-based approaches can be applied to address various issues in cell factorydevelopment using examples from recent studies, including the prediction of protein function, improvement of metabolicmodels, and estimation of missing kinetic parameters, design of non-natural biosynthesis pathways, and pathway optimization.In the last section, we discuss the current limitations of these data-driven approaches and propose that data-driven methodsshould be integrated with mechanistic models to complement each other and facilitate the development of synthetic strains forindustrial biomanufacturing.展开更多
基金This research was funded by the National Key R&D Program of China(2022YFC2106000)National Natural Science Foundation of China(32300529,32201242,12326611)+2 种基金Tianjin Synthetic Biotechnology Innovation Capacity Improvement Projects(TSBICIP-PTJS-001,TSBICIP-PTJJ-007)Major Program of Haihe Laboratory of Synthetic Biology(22HHSWSS00021)Strategic Priority Research Program of the Chinese Academy of Sciences(XDC0120201)。
文摘Proteins play a pivotal role in coordinating the functions of organisms,essentially governing their traits,as the dynamic arrangement of diverse amino acids leads to a multitude of folded configurations within peptide chains.Despite dynamic changes in amino acid composition of an individual protein(referred to as AAP)and great variance in protein expression levels under different conditions,our study,utilizing transcriptomics data from four model organisms uncovers surprising stability in the overall amino acid composition of the total cellular proteins(referred to as AACell).Although this value may vary between different species,we observed no significant differences among distinct strains of the same species.This indicates that organisms enforce system-level constraints to maintain a consistent AACell,even amid fluctuations in AAP and protein expression.Further exploration of this phenomenon promises insights into the intricate mechanisms orchestrating cellular protein expression and adaptation to varying environmental challenges.
基金the National Key Research and Development Program of China(2021YFC2100700)National Natural Science Foundation of China(32300529,32201242,12326611)+2 种基金Tianjin Synthetic Biotechnology Innovation Capacity Improvement Projects(TSBICIPPTJS-001,TSBICIP-PTJS-002,TSBICIP-PTJJ-007)Major Program of Haihe Laboratory of Synthetic Biology(22HHSWSS00021)Strategic Priority Research Program of the Chinese Academy of Sciences(XDB0480000).
文摘Genome-scale metabolic models(GEMs)have been widely employed to predict microorganism behaviors.However,GEMs only consider stoichiometric constraints,leading to a linear increase in simulated growth and product yields as substrate uptake rates rise.This divergence from experimental measurements prompted the creation of enzyme-constrained models(ecModels)for various species,successfully enhancing chemical pro-duction.Building upon studies that allocate macromolecule resources,we developed a Python-based workflow(ECMpy)that constructs an enzyme-constrained model.This involves directly imposing an enzyme amount constraint in GEM and accounting for protein subunit composition in reactions.However,this procedure de-mands manual collection of enzyme kinetic parameter information and subunit composition details,making it rather user-unfriendly.In this work,we’ve enhanced the ECMpy toolbox to version 2.0,broadening its scope to automatically generate ecGEMs for a wider array of organisms.ECMpy 2.0 automates the retrieval of enzyme kinetic parameters and employs machine learning for predicting these parameters,which significantly enhances parameter coverage.Additionally,ECMpy 2.0 introduces common analytical and visualization features for ecModels,rendering computational results more user accessible.Furthermore,ECMpy 2.0 seamlessly integrates three published algorithms that exploit ecModels to uncover potential targets for metabolic engineering.ECMpy 2.0 is available at https://github.com/tibbdc/ECMpy or as a pip package(https://pypi.org/project/ECMpy/).
基金the National Key Research and Development Program of China(2020YFA0908300)the National Natural Science Foundation of China(32201242)+2 种基金the Youth Innovation Promotion Association CAS,Innovation fund of Haihe Laboratory of Synthetic Biology(22HHSWSS00021)Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project(TSBICIP-PTJS-001,TSBICIP-CXRC-018,and TSBICIP-PTJJ-007)the China Postdoctoral Science Foundation(2022M713328).
文摘Enzyme commission(EC)numbers,which associate a protein sequence with the biochemical reactions it catalyzes,are essential for the accurate understanding of enzyme functions and cellular metabolism.Many ab initio computational approaches were proposed to predict EC numbers for given input protein sequences.However,the prediction performance(accuracy,recall,and precision),usability,and efficiency of existing methods decreased seriously when dealing with recently discovered proteins,thus still having much room to be improved.Here,we report HDMLF,a hierarchical dual-core multitask learning framework for accurately predicting EC numbers based on novel deep learning techniques.HDMLF is composed of an embedding core and a learning core;the embedding core adopts the latest protein language model for protein sequence embedding,and the learning core conducts the EC number prediction.Specifically,HDMLF is designed on the basis of a gated recurrent unit framework to perform EC number prediction in the multi-objective hierarchy,multitasking manner.Additionally,we introduced an attention layer to optimize the EC prediction and employed a greedy strategy to integrate and fine-tune the final model.Comparative analyses against 4 representative methods demonstrate that HDMLF stably delivers the highest performance,which improves accuracy and F1 score by 60%and 40%over the state of the art,respectively.An additional case study of tyrB predicted to compensate for the loss of aspartate aminotransferase aspC,as reported in a previous experimental study,shows that our model can also be used to uncover the enzyme promiscuity.Finally,we established a web platform,namely,ECRECer(https://ecrecer.biodesign.ac.cn),using an entirely could-based serverless architecture and provided an offline bundle to improve usability.
基金funded by the National Key Research and Development Program of China(2018YFA0900300,2020YFA0908301)the National Natural Science Foundation of China(32201188)+1 种基金the Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project(TSBICIP-CXRC-060,TSBICIP-PTJS-001,and TSBICIP-PTJS-013)the China Postdoctoral Science Foundation(2022M723341).
文摘Metabolic network models have become increasingly precise and accurate as the most widespread and practical digital representations of living cells.The prediction functions were significantly expanded by integrating cellular resources and abiotic constraints in recent years.However,if unreasonable modeling methods were adopted due to a lack of consideration of biological knowledge,the conflicts between stoichiometric and other constraints,such as thermodynamic feasibility and enzyme resource availability,would lead to distorted predictions.In this work,we investigated a prediction anomaly of EcoETM,a constraints-based metabolic network model,and introduced the idea of enzyme compartmentalization into the analysis process.Through rational combination of reactions,we avoid the false prediction of pathway feasibility caused by the unrealistic assumption of free intermediate metabolites.This allowed us to correct the pathway structures of L-serine and L-tryptophan.A specific analysis explains the application method of the EcoETM-like model and demonstrates its potential and value in correcting the prediction results in pathway structure by resolving the conflict between different constraints and incorporating the evolved roles of enzymes as reaction compartments.Notably,this work also reveals the trade-off between product yield and thermodynamic feasibility.Our work is of great value for the structural improvement of constraints-based models.
基金funded by the National Key Research and Development Program of China(2018YFA0901400)the Strategic Priority Research Program of the Chinese Academy of Sciences(XDB0480000)+1 种基金Tianjin Synthetic Biotechnology Innovation Capacity Improvement Projects(TSBICIP-PTJS-001)Ministry of Science of China and Youth Innovation Promotion Association CAS(292023000018).
文摘Pseudomonas stutzeri A1501 is a non-fluorescent denitrifying bacteria that belongs to the gram-negative bacterial group.As a prominent strain in the fields of agriculture and bioengineering,there is still a lack of comprehensive understanding regarding its metabolic capabilities,specifically in terms of central metabolism and substrate utilization.Therefore,further exploration and extensive studies are required to gain a detailed insight into these aspects.This study reconstructed a genome-scale metabolic network model for P.stutzeri A1501 and conducted extensive curations,including correcting energy generation cycles,respiratory chains,and biomass composition.The final model,iQY1018,was successfully developed,covering more genes and reactions and having higher prediction accuracy compared with the previously published model iPB890.The substrate utilization ability of 71 carbon sources was investigated by BIOLOG experiment and was utilized to validate the model quality.The model prediction accuracy of substrate utilization for P.stutzeri A1501 reached 90%.The model analysis revealed its new ability in central metabolism and predicted that the strain is a suitable chassis for the production of Acetyl CoA-derived products.This work provides an updated,high-quality model of P.stutzeri A1501for further research and will further enhance our understanding of the metabolic capabilities.
基金Publication costs are funded by the National Key Research and Development Program of China(2018YFA0900300,2018YFA0901400)the International Partnership Program of Chinese Academy of Sciences(153D31KYSB20170121)Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project(TSBICIPPTJS-001,TSBICIP-KJGG-005).
文摘Escherichia coli is a model organism with a clear genetic background that is widely used in metabolic engineering and synthetic biology research.To gain a complete picture of the complexly metabolic and regulatory interactions in E.coli,researchers often need to retrieve information from various databases which cover diferent types of interactions.A central one-stop service integrating various molecular interactions in E.coli would be helpful for the community.We constructed a database called E.coli integrated network(EcoIN)by integrating known molecular interaction information from databases and literature.EcoIN contains nearly 160,000 pairs of interactions and users can easily search the diferent types of interacting partners for a metabolite,gene or protein,and thus gain access to a more comprehensive interaction map of E.coli.To illustrate the application of EcoIN,we used the full path algorithm to identify metabolic feedback/feedforward regulatory loops having at least two diferent types of regulatory interactions.Applying this algorithm to analyze the regulatory loops for the amino acid biosynthetic pathways,we found some multi-step regulation loops which may afect the metabolic fux and are potential new engineering targets.The EcoIN database is freely accessible at http://ecoin.ibiodesign.net/and analysis codes are available at GitHub:https://github.com/maozhitao/EcoIN.
基金the National Key Research and Development Program of China(grant number 2018YFA0900300)the International Partnership Program of Chinese Academy of Sciences(grant number 153D31KYSB20170121)Youth Innovation Promotion Association CAS,and the Tianjin Synthetic Biotechnology Innovation Capacity Improvement Project(grant numbers TSBICIP-PTJS-001 and TSBICIP-CXRC-018).
文摘Revolutionary breakthroughs in artificial intelligence (AI) and machine learning (ML) have had a profound impact on a widerange of scientific disciplines, including the development of artificial cell factories for biomanufacturing. In this paper, wereview the latest studies on the application of data-driven methods for the design of new proteins, pathways, and strains. Wefirst briefly introduce the various types of data and databases relevant to industrial biomanufacturing, which are the basis fordata-driven research. Different types of algorithms, including traditional ML and more recent deep learning methods, are alsopresented. We then demonstrate how these data-based approaches can be applied to address various issues in cell factorydevelopment using examples from recent studies, including the prediction of protein function, improvement of metabolicmodels, and estimation of missing kinetic parameters, design of non-natural biosynthesis pathways, and pathway optimization.In the last section, we discuss the current limitations of these data-driven approaches and propose that data-driven methodsshould be integrated with mechanistic models to complement each other and facilitate the development of synthetic strains forindustrial biomanufacturing.