Here,a new integrated machine learning and Chou’s pseudo amino acid composition method has been proposed for in silico epitope mapping of severe acute respiratorysyndrome-like coronavirus antigens.For this,a training...Here,a new integrated machine learning and Chou’s pseudo amino acid composition method has been proposed for in silico epitope mapping of severe acute respiratorysyndrome-like coronavirus antigens.For this,a training dataset including 266 linear B-cell epitopes,1,267 T-cell epitopes and 1,280 non-epitopes were prepared.The epitope sequences were then converted to numerical vectors using Chou’s pseudo amino acid composition method.The vectors were then introduced to the support vector machine,random forest,artificial neural network,and K-nearest neighbor algorithms for the classification process.The algorithm with the highest performance was selected for the epitope mapping procedure.Based on the obtained results,the random forest algorithm was the most accurate classifier with an accuracy of 0.934 followed by K-nearest neighbor,artificial neural network,and support vector machine respectively.Furthermore,the efficacies of predicted epitopes by the trained random forest algorithm were assessed through their antigenicity potential as well as affinity to human B cell receptor and MHC-I/II alleles using the VaxiJen score and molecular docking,respectively.It was also clear that the predicted epitopes especially the B-cell epitopes had high antigenicity potentials and good affinities to the protein targets.According to the results,the suggested method can be considered for developing specific epitope predictor software as well as an accelerator pipeline for designing serotype independent vaccine against the virus.展开更多
The basic unit in life is cell.?It contains many protein molecules located at its different organelles. The growth and reproduction of a cell as well as most of its other biological functions are performed via these p...The basic unit in life is cell.?It contains many protein molecules located at its different organelles. The growth and reproduction of a cell as well as most of its other biological functions are performed via these proteins. But proteins in different organelles or subcellular locations have different functions. Facing?the avalanche of protein sequences generated in the postgenomic age, we are challenged to develop high throughput tools for identifying the subcellular localization of proteins based on their sequence information alone. Although considerable efforts have been made in this regard, the problem is far apart from being solved yet. Most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions that are particularly important for drug targets. Using the ML-GKR (Multi-Label Gaussian Kernel Regression) method,?we developed a new predictor called “pLoc-mGpos” by in-depth extracting the key information from GO (Gene Ontology) into the Chou’s general PseAAC (Pseudo Amino Acid Composition)?for predicting the subcellular localization of Gram-positive bacterial proteins with both single and multiple location sites. Rigorous cross-validation on a same stringent benchmark dataset indicated that the proposed pLoc-mGpos predictor is remarkably superior to “iLoc-Gpos”, the state-of-the-art predictor for the same purpose.?To maximize the convenience of most experimental scientists, a user-friendly web-server for the new powerful predictor has been established at http://www.jci-bioinfo.cn/pLoc-mGpos/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.展开更多
Detecting remote homology proteins is a challenging problem for both basic research and drug development. Although there are a couple of methods to deal with this problem, the benchmark datasets based on which the exi...Detecting remote homology proteins is a challenging problem for both basic research and drug development. Although there are a couple of methods to deal with this problem, the benchmark datasets based on which the existing methods were trained and tested contain many high homologous samples as reflected by the fact that the cutoff threshold was set at 95%. In this study, we reconstructed the benchmark dataset by setting the threshold at 40%, meaning none of the proteins included in the benchmark dataset has more than 40% pairwise sequence identity with any other in the same subset. Using the new benchmark dataset, we proposed a new predictor called “dRHP-GreyFun” based on the grey modeling and functional domain approach. Rigorous cross-validations have indicated that the new predictor is superior to its counterparts in both enhancing success rates and reducing computational cost. The predictor can be downloaded from https://github.com/jcilwz/dRHP-GreyFun.展开更多
It has been a dream that theoretical biology can be extensively applied in experimental biology to accelerate the understanding of the sophiscated movements in living organisms. A brave assay and an excellent example ...It has been a dream that theoretical biology can be extensively applied in experimental biology to accelerate the understanding of the sophiscated movements in living organisms. A brave assay and an excellent example were represented by enzymology, in which the well-established physico-chemistry is used to describe, to fit, to predict and to improve enzyme reactions. Before the modern bioinformatics, the developments of the combination of theoretical biology and experimental biology have been mainly limited to various classic formulations. The systematic use of graphic rules by Prof. Kuo-Chen Chou and his co-workers has significantly facilitated to deal with complicated enzyme systems. With the recent fast progress of bioinformatics, prediction of protein structures and various protein attributes have been well established by Chou and co-workers, stimulating the experimental biology. For example, their recent method for predicting protein subcellular localization (one of the important attributes of proteins) has been extensively applied by scientific colleagues, yielding many new results with thousands of citations. The research by Prof. Chou is characterized by introducing novel physical concepts as well as powerful and elegant mathematical methods into important biomedical problems, a focus throughout his career, even when facing enormous difficulties. His efforts in 50 years have greatly helped us to realize the dream to make “theoretical and experimental biology in one”. Prof. Richard Giege is well known for his multi-disciplinary research combining physics, chemistry, enzymology and molecular biology. His major focus of study is on the identity of tRNAs and their interactions with aminoacyl-tRNA synthetases (aaRS), which are of critical importance to the fidelity of protein biosynthesis. He and his colleagues have carried out the first crystallization of a tRNA/aaRS complex, that between tRNAAsp and AspRS from yeast. The determination of the complex structure contributed significantly to under- stand the interaction of protein and RNA. From his fine research, they have also found other biological function of these small RNAs. He has developed in parallel appropriate methods for his research, of which the protein crystallogenesis, a name he has coined, is an excellent example. Now macromolecular crystallogenesis has become a developed science. In fact, such contribution has accelerated the development of protein crystallography, stimulating the study of macromolecular structure and function.展开更多
Glycation is a non-enzymatic post-translational modification which assigns sugar molecule and residues to a peptide.It is a clinically important attribute to numerous age-related,metabolic,and chronic diseases such as...Glycation is a non-enzymatic post-translational modification which assigns sugar molecule and residues to a peptide.It is a clinically important attribute to numerous age-related,metabolic,and chronic diseases such as diabetes,Alzheimer’s,renal failure,etc.Identification of a non-enzymatic reaction are quite challenging in research.Manual identification in labs is a very costly and timeconsuming process.In this research,we developed an accurate,valid,and a robust model named as Gly-LysPred to differentiate the glycated sites from non-glycated sites.Comprehensive techniques using position relative features are used for feature extraction.An algorithm named as a random forest with some preprocessing techniques and feature engineering techniques was developed to train a computational model.Various types of testing techniques such as self-consistency testing,jackknife testing,and cross-validation testing are used to evaluate the model.The overall model’s accuracy was accomplished through self-consistency,jackknife,and cross-validation testing 100%,99.92%,and 99.88%with MCC 1.00,0.99,and 0.997 respectively.In this regard,a user-friendly webserver is also urbanized to accumulate the whole procedure.These features vectorization methods suggest that they can play a critical role in other web servers which are developed to classify lysine glycation.展开更多
A systematic introduction has been presented for the recent advances in predicting protein subcellular localization in the multi-label systems, where the constituent proteins may simultaneously occur or move between t...A systematic introduction has been presented for the recent advances in predicting protein subcellular localization in the multi-label systems, where the constituent proteins may simultaneously occur or move between two or more location sites and hence have exceptional biological functions worthy of our special notice. All the predictors included in this review each have a user-friendly web-server, by which the majority of experimental scientists can very easily acquire their desired data without the need to go through the complicated mathematics involved.展开更多
The biological </span><span style="font-family:Verdana;font-size:12px;">principal</span><span style="font-family:Verdana;font-size:12px;"> or its detailed mechanism for the ...The biological </span><span style="font-family:Verdana;font-size:12px;">principal</span><span style="font-family:Verdana;font-size:12px;"> or its detailed mechanism for the pandemic coronavirus disease 2019 (COVID-19) has been investigated and analyzed from the topological entropy approach. The findings thus obtained have provided very useful clues and information for developing both powerful and safe vaccines against the pandemic COVID-19.展开更多
文摘Here,a new integrated machine learning and Chou’s pseudo amino acid composition method has been proposed for in silico epitope mapping of severe acute respiratorysyndrome-like coronavirus antigens.For this,a training dataset including 266 linear B-cell epitopes,1,267 T-cell epitopes and 1,280 non-epitopes were prepared.The epitope sequences were then converted to numerical vectors using Chou’s pseudo amino acid composition method.The vectors were then introduced to the support vector machine,random forest,artificial neural network,and K-nearest neighbor algorithms for the classification process.The algorithm with the highest performance was selected for the epitope mapping procedure.Based on the obtained results,the random forest algorithm was the most accurate classifier with an accuracy of 0.934 followed by K-nearest neighbor,artificial neural network,and support vector machine respectively.Furthermore,the efficacies of predicted epitopes by the trained random forest algorithm were assessed through their antigenicity potential as well as affinity to human B cell receptor and MHC-I/II alleles using the VaxiJen score and molecular docking,respectively.It was also clear that the predicted epitopes especially the B-cell epitopes had high antigenicity potentials and good affinities to the protein targets.According to the results,the suggested method can be considered for developing specific epitope predictor software as well as an accelerator pipeline for designing serotype independent vaccine against the virus.
文摘The basic unit in life is cell.?It contains many protein molecules located at its different organelles. The growth and reproduction of a cell as well as most of its other biological functions are performed via these proteins. But proteins in different organelles or subcellular locations have different functions. Facing?the avalanche of protein sequences generated in the postgenomic age, we are challenged to develop high throughput tools for identifying the subcellular localization of proteins based on their sequence information alone. Although considerable efforts have been made in this regard, the problem is far apart from being solved yet. Most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions that are particularly important for drug targets. Using the ML-GKR (Multi-Label Gaussian Kernel Regression) method,?we developed a new predictor called “pLoc-mGpos” by in-depth extracting the key information from GO (Gene Ontology) into the Chou’s general PseAAC (Pseudo Amino Acid Composition)?for predicting the subcellular localization of Gram-positive bacterial proteins with both single and multiple location sites. Rigorous cross-validation on a same stringent benchmark dataset indicated that the proposed pLoc-mGpos predictor is remarkably superior to “iLoc-Gpos”, the state-of-the-art predictor for the same purpose.?To maximize the convenience of most experimental scientists, a user-friendly web-server for the new powerful predictor has been established at http://www.jci-bioinfo.cn/pLoc-mGpos/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
文摘Detecting remote homology proteins is a challenging problem for both basic research and drug development. Although there are a couple of methods to deal with this problem, the benchmark datasets based on which the existing methods were trained and tested contain many high homologous samples as reflected by the fact that the cutoff threshold was set at 95%. In this study, we reconstructed the benchmark dataset by setting the threshold at 40%, meaning none of the proteins included in the benchmark dataset has more than 40% pairwise sequence identity with any other in the same subset. Using the new benchmark dataset, we proposed a new predictor called “dRHP-GreyFun” based on the grey modeling and functional domain approach. Rigorous cross-validations have indicated that the new predictor is superior to its counterparts in both enhancing success rates and reducing computational cost. The predictor can be downloaded from https://github.com/jcilwz/dRHP-GreyFun.
文摘It has been a dream that theoretical biology can be extensively applied in experimental biology to accelerate the understanding of the sophiscated movements in living organisms. A brave assay and an excellent example were represented by enzymology, in which the well-established physico-chemistry is used to describe, to fit, to predict and to improve enzyme reactions. Before the modern bioinformatics, the developments of the combination of theoretical biology and experimental biology have been mainly limited to various classic formulations. The systematic use of graphic rules by Prof. Kuo-Chen Chou and his co-workers has significantly facilitated to deal with complicated enzyme systems. With the recent fast progress of bioinformatics, prediction of protein structures and various protein attributes have been well established by Chou and co-workers, stimulating the experimental biology. For example, their recent method for predicting protein subcellular localization (one of the important attributes of proteins) has been extensively applied by scientific colleagues, yielding many new results with thousands of citations. The research by Prof. Chou is characterized by introducing novel physical concepts as well as powerful and elegant mathematical methods into important biomedical problems, a focus throughout his career, even when facing enormous difficulties. His efforts in 50 years have greatly helped us to realize the dream to make “theoretical and experimental biology in one”. Prof. Richard Giege is well known for his multi-disciplinary research combining physics, chemistry, enzymology and molecular biology. His major focus of study is on the identity of tRNAs and their interactions with aminoacyl-tRNA synthetases (aaRS), which are of critical importance to the fidelity of protein biosynthesis. He and his colleagues have carried out the first crystallization of a tRNA/aaRS complex, that between tRNAAsp and AspRS from yeast. The determination of the complex structure contributed significantly to under- stand the interaction of protein and RNA. From his fine research, they have also found other biological function of these small RNAs. He has developed in parallel appropriate methods for his research, of which the protein crystallogenesis, a name he has coined, is an excellent example. Now macromolecular crystallogenesis has become a developed science. In fact, such contribution has accelerated the development of protein crystallography, stimulating the study of macromolecular structure and function.
基金the Research Management Center,Xiamen University Malaysia under XMUM Research Program Cycle 4(Grant No.XMUMRF/2019-C4/IECE/0012).
文摘Glycation is a non-enzymatic post-translational modification which assigns sugar molecule and residues to a peptide.It is a clinically important attribute to numerous age-related,metabolic,and chronic diseases such as diabetes,Alzheimer’s,renal failure,etc.Identification of a non-enzymatic reaction are quite challenging in research.Manual identification in labs is a very costly and timeconsuming process.In this research,we developed an accurate,valid,and a robust model named as Gly-LysPred to differentiate the glycated sites from non-glycated sites.Comprehensive techniques using position relative features are used for feature extraction.An algorithm named as a random forest with some preprocessing techniques and feature engineering techniques was developed to train a computational model.Various types of testing techniques such as self-consistency testing,jackknife testing,and cross-validation testing are used to evaluate the model.The overall model’s accuracy was accomplished through self-consistency,jackknife,and cross-validation testing 100%,99.92%,and 99.88%with MCC 1.00,0.99,and 0.997 respectively.In this regard,a user-friendly webserver is also urbanized to accumulate the whole procedure.These features vectorization methods suggest that they can play a critical role in other web servers which are developed to classify lysine glycation.
文摘A systematic introduction has been presented for the recent advances in predicting protein subcellular localization in the multi-label systems, where the constituent proteins may simultaneously occur or move between two or more location sites and hence have exceptional biological functions worthy of our special notice. All the predictors included in this review each have a user-friendly web-server, by which the majority of experimental scientists can very easily acquire their desired data without the need to go through the complicated mathematics involved.
文摘The biological </span><span style="font-family:Verdana;font-size:12px;">principal</span><span style="font-family:Verdana;font-size:12px;"> or its detailed mechanism for the pandemic coronavirus disease 2019 (COVID-19) has been investigated and analyzed from the topological entropy approach. The findings thus obtained have provided very useful clues and information for developing both powerful and safe vaccines against the pandemic COVID-19.