Proteins function as integral actors in essential life processes,rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investig...Proteins function as integral actors in essential life processes,rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation.Within the context of protein research,an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings.Due to the exorbitant costs and limited throughput inherent in experimental investigations,computational models offer a promising alternative to accelerate protein function annotation.In recent years,protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks.This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction.In this review,we elucidate the historical evolution and research paradigms of computational methods for predicting protein function.Subsequently,we summarize the progress in protein and molecule representation as well as feature extraction techniques.Furthermore,we assess the performance of machine learning-based algorithms across various objectives in protein function prediction,thereby offering a comprehensive perspective on the progress within this field.展开更多
The number of available protein sequences in public databases is increasing exponentially.However,a sig-nificant percentage of these sequences lack functional annotation,which is essential for the understanding of how...The number of available protein sequences in public databases is increasing exponentially.However,a sig-nificant percentage of these sequences lack functional annotation,which is essential for the understanding of how bio-logical systems operate.Here,we propose a novel method,Quantitative Annotation of Unknown STructure(QAUST),to infer protein functions,specifically Gene Ontology(GO)terms and Enzyme Commission(EC)numbers.QAUST uses three sources of information:structure information encoded by global and local structure similarity search,biological network information inferred by protein–protein interaction data,and sequence information extracted from functionally discriminative sequence motifs.These three pieces of information are combined by consensus averaging to make the final prediction.Our approach has been tested on 500 protein targets from the Critical Assessment of Functional Annotation(CAFA)benchmark set.The results show that our method provides accurate functional annotation and outperforms other prediction methods based on sequence similarity search or threading.We further demonstrate that a previously unknown function of human tripartite motif-containing 22(TRIM22)protein predicted by QAUST can be experimentally validated.展开更多
As one of the state-of-the-art automated function prediction(AFP)methods,NetGO 2.0 integrates multi-source information to improve the performance.However,it mainly utilizes the proteins with experimentally supported f...As one of the state-of-the-art automated function prediction(AFP)methods,NetGO 2.0 integrates multi-source information to improve the performance.However,it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins.Recently,protein language models have been proposed to learn informative representations[e.g.,Evolutionary Scale Modeling(ESM)-1b embedding] from protein sequences based on self-supervision.Here,we represented each protein by ESM-1b and used logistic regression(LR)to train a new model,LR-ESM,for AFP.The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0.Therefore,by incorporating LR-ESM into NetGO 2.0,we developed NetGO 3.0 to improve the performance of AFP extensively.展开更多
It has been well accepted that the folding energy landscape may resemble a funnel according to the theory of protein folding. This theory of "folding funnel" has been extensively studied and thought to play an impor...It has been well accepted that the folding energy landscape may resemble a funnel according to the theory of protein folding. This theory of "folding funnel" has been extensively studied and thought to play an important role in guiding the sampling process of the protein folding and refinement in protein structure prediction. Here, we have investigated the relationship between the "funnel likeness" of protein folding and the size/structure of the proteins based on a set of non-homologous proteins we have recently evaluated using a statistical mechanicsbased scoring function ITScorePro. It was found that larger proteins that consist of more helix/sheet structures tend to have a higher score-Root Mean Square Deviation(RMSD) correlation(or a more funnel like energy landscape).Another measurement in protein folding, Z-score, has also shown some correlation with the size of the proteins.As expected, proteins with a better "olding funnel likeness"(or score-RMSD correlation) tend to have a betterpredicted conformation with a lower RMSD from their native structures. These findings can be extremely valuable for the development and improvement of sampling and scoring algorithms for protein structure prediction.展开更多
基金supported in part by the National Natural Science Foundation of China(22033001)the National Key R&D Program of China(2022YFA1303700)the Chinese Academy of Medical Sciences(2021-I2M-5-014).
文摘Proteins function as integral actors in essential life processes,rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation.Within the context of protein research,an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings.Due to the exorbitant costs and limited throughput inherent in experimental investigations,computational models offer a promising alternative to accelerate protein function annotation.In recent years,protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks.This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction.In this review,we elucidate the historical evolution and research paradigms of computational methods for predicting protein function.Subsequently,we summarize the progress in protein and molecule representation as well as feature extraction techniques.Furthermore,we assess the performance of machine learning-based algorithms across various objectives in protein function prediction,thereby offering a comprehensive perspective on the progress within this field.
基金supported by the King Abdullah University of Science and Technology(KAUST)Office of Sponsored Research(OSR)(Grant Nos.URF/1/1976-04,URF/1/1976-06)。
文摘The number of available protein sequences in public databases is increasing exponentially.However,a sig-nificant percentage of these sequences lack functional annotation,which is essential for the understanding of how bio-logical systems operate.Here,we propose a novel method,Quantitative Annotation of Unknown STructure(QAUST),to infer protein functions,specifically Gene Ontology(GO)terms and Enzyme Commission(EC)numbers.QAUST uses three sources of information:structure information encoded by global and local structure similarity search,biological network information inferred by protein–protein interaction data,and sequence information extracted from functionally discriminative sequence motifs.These three pieces of information are combined by consensus averaging to make the final prediction.Our approach has been tested on 500 protein targets from the Critical Assessment of Functional Annotation(CAFA)benchmark set.The results show that our method provides accurate functional annotation and outperforms other prediction methods based on sequence similarity search or threading.We further demonstrate that a previously unknown function of human tripartite motif-containing 22(TRIM22)protein predicted by QAUST can be experimentally validated.
基金supported by the National Natural Science Foundation of China(Grant Nos.61872094 and 62272105)the Shanghai Municipal Science and Technology Major Project(Grant No.2018SHZDZX01)+2 种基金the ZJ Lab,and the Shanghai Research Center for Brain Science and Brain-Inspired Intelligence Technology.Shaojun Wang and Ronghui You have been supported by the lll Project(Grant No.B18015)the Shanghai Municipal Science and Technology Major Project(Grant No.2017SHZDZX01)the Information Technology Facility,CAS-MPG Partner Institute for Computational Biology,Shanghai Institute for Biological Sciences,Chinese Academy of Sciences.Yi Xiong has been supported by the National Natural Science Foundation of China(Grant Nos.61832019 and 62172274).
文摘As one of the state-of-the-art automated function prediction(AFP)methods,NetGO 2.0 integrates multi-source information to improve the performance.However,it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins.Recently,protein language models have been proposed to learn informative representations[e.g.,Evolutionary Scale Modeling(ESM)-1b embedding] from protein sequences based on self-supervision.Here,we represented each protein by ESM-1b and used logistic regression(LR)to train a new model,LR-ESM,for AFP.The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0.Therefore,by incorporating LR-ESM into NetGO 2.0,we developed NetGO 3.0 to improve the performance of AFP extensively.
文摘It has been well accepted that the folding energy landscape may resemble a funnel according to the theory of protein folding. This theory of "folding funnel" has been extensively studied and thought to play an important role in guiding the sampling process of the protein folding and refinement in protein structure prediction. Here, we have investigated the relationship between the "funnel likeness" of protein folding and the size/structure of the proteins based on a set of non-homologous proteins we have recently evaluated using a statistical mechanicsbased scoring function ITScorePro. It was found that larger proteins that consist of more helix/sheet structures tend to have a higher score-Root Mean Square Deviation(RMSD) correlation(or a more funnel like energy landscape).Another measurement in protein folding, Z-score, has also shown some correlation with the size of the proteins.As expected, proteins with a better "olding funnel likeness"(or score-RMSD correlation) tend to have a betterpredicted conformation with a lower RMSD from their native structures. These findings can be extremely valuable for the development and improvement of sampling and scoring algorithms for protein structure prediction.