The advantages and disadvantages of genetic algorithm and BP algorithm are introduced. A neural network based on GA-BP algorithm is proposed and applied in the prediction of protein secondary structure, which combines...The advantages and disadvantages of genetic algorithm and BP algorithm are introduced. A neural network based on GA-BP algorithm is proposed and applied in the prediction of protein secondary structure, which combines the advantages of BP and GA. The prediction and training on the neural network are made respectively based on 4 structure classifications of protein so as to get higher rate of predication---the highest prediction rate 75.65%,the average prediction rate 65.04%.展开更多
The algorithm based on combination learning usually is superior to a singleclassification algorithm on the task of protein secondary structure prediction. However,the assignment of the weight of the base classifier us...The algorithm based on combination learning usually is superior to a singleclassification algorithm on the task of protein secondary structure prediction. However,the assignment of the weight of the base classifier usually lacks decision-makingevidence. In this paper, we propose a protein secondary structure prediction method withdynamic self-adaptation combination strategy based on entropy, where the weights areassigned according to the entropy of posterior probabilities outputted by base classifiers.The higher entropy value means a lower weight for the base classifier. The final structureprediction is decided by the weighted combination of posterior probabilities. Extensiveexperiments on CB513 dataset demonstrates that the proposed method outperforms theexisting methods, which can effectively improve the prediction performance.展开更多
The secondary structure of a protein is critical for establishing a link between the protein primary and tertiary structures.For this reason,it is important to design methods for accurate protein secondary structure p...The secondary structure of a protein is critical for establishing a link between the protein primary and tertiary structures.For this reason,it is important to design methods for accurate protein secondary structure prediction.Most of the existing computational techniques for protein structural and functional prediction are based onmachine learning with shallowframeworks.Different deep learning architectures have already been applied to tackle protein secondary structure prediction problem.In this study,deep learning based models,i.e.,convolutional neural network and long short-term memory for protein secondary structure prediction were proposed.The input to proposed models is amino acid sequences which were derived from CulledPDB dataset.Hyperparameter tuning with cross validation was employed to attain best parameters for the proposed models.The proposed models enables effective processing of amino acids and attain approximately 87.05%and 87.47%Q3 accuracy of protein secondary structure prediction for convolutional neural network and long short-term memory models,respectively.展开更多
Protein structure prediction is one of the most essential objectives practiced by theoretical chemistry and bioinformatics as it is of a vital importance in medicine,biotechnology and more.Protein secondary structure ...Protein structure prediction is one of the most essential objectives practiced by theoretical chemistry and bioinformatics as it is of a vital importance in medicine,biotechnology and more.Protein secondary structure prediction(PSSP)has a significant role in the prediction of protein tertiary structure,as it bridges the gap between the protein primary sequences and tertiary structure prediction.Protein secondary structures are classified into two categories:3-state category and 8-state category.Predicting the 3 states and the 8 states of secondary structures from protein sequences are called the Q3 prediction and the Q8 prediction problems,respectively.The 8 classes of secondary structures reveal more precise structural information for a variety of applications than the 3 classes of secondary structures,however,Q8 prediction has been found to be very challenging,that is why all previous work done in PSSP have focused on Q3 prediction.In this paper,we develop an ensemble Machine Learning(ML)approach for Q8 PSSP to explore the performance of ensemble learning algorithms compared to that of individual ML algorithms in Q8 PSSP.The ensemble members considered for constructing the ensemble models are well known classifiers,namely SVM(Support Vector Machines),KNN(K-Nearest Neighbor),DT(Decision Tree),RF(Random Forest),and NB(Naïve Bayes),with two feature extraction techniques,namely LDA(Linear Discriminate Analysis)and PCA(Principal Component Analysis).Experiments have been conducted for evaluating the performance of single models and ensemble models,with PCA and LDA,in Q8 PSSP.The novelty of this paper lies in the introduction of ensemble learning in Q8 PSSP problem.The experimental results confirmed that ensemble ML models are more accurate than individual ML models.They also indicated that features extracted by LDA are more effective than those extracted by PCA.展开更多
Protein Secondary Structure Prediction (PSSP) is considered as one of the major challenging tasks in bioinformatics, so many solutions have been proposed to solve that problem via trying to achieve more accurate predi...Protein Secondary Structure Prediction (PSSP) is considered as one of the major challenging tasks in bioinformatics, so many solutions have been proposed to solve that problem via trying to achieve more accurate prediction results. The goal of this paper is to develop and implement an intelligent based system to predict secondary structure of a protein from its primary amino acid sequence by using five models of Neural Network (NN). These models are Feed Forward Neural Network (FNN), Learning Vector Quantization (LVQ), Probabilistic Neural Network (PNN), Convolutional Neural Network (CNN), and CNN Fine Tuning for PSSP. To evaluate our approaches two datasets have been used. The first one contains 114 protein samples, and the second one contains 1845 protein samples.展开更多
The folding dynamics and structural characteristics of peptides RTKAWNRQLYPEW (P1) and RTKQLYPEW (P2) are investigated by using all-atomic simulation procedure CHARMM in this work. The results show that P1, a segm...The folding dynamics and structural characteristics of peptides RTKAWNRQLYPEW (P1) and RTKQLYPEW (P2) are investigated by using all-atomic simulation procedure CHARMM in this work. The results show that P1, a segment of an antigen, has a folding motif of α-helix, whereas P2, which is derived by deleting four residues AWNR from peptide P1, prevents the formation of helix and presents a β-strand. And peptlde P1 experiences a more rugged energy landscape than peptide P2. From our results, it is inferred that the antibody CD8 cytolytic T lymphocyte prefers an antigen with a β-folding structure to that with an α-helical one.展开更多
A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been ext...A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship be-tween amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein sec-ondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact cal-culation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary struc-tures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/bi-ology/index.html.展开更多
prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s me...prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s mechanisms are too complex to be able to extract clear and straightforward physical meanings from it. This paper explores population-based incremental learning (PBIL), which is a method that combines the mechanisms of a generational genetic algorithm with simple competitive learning. The result shows that its accuracies are particularly associated with the Homo species. This new perspective reveals a number of different possibilities for the purposes of performance improvements.展开更多
Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of futu...Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. There is certainly some noise in the data and there may be significant noise. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A sufficient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the effectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then ex-tend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.展开更多
It has been shown that the progress in the determination of membrane protein structure grows exponentially, with approximately the same growth rate as that of the water-soluble proteins. In order to investigate the ef...It has been shown that the progress in the determination of membrane protein structure grows exponentially, with approximately the same growth rate as that of the water-soluble proteins. In order to investigate the effect of this, on the performance of prediction algorithms for both α-helical and β-barrel membrane proteins, we conducted a prospective study based on historical records. We trained separate hidden Markov models with different sized training sets and evaluated their performance on topology prediction for the two classes of transmembrane proteins. We show that the existing top-scoring algorithms for predicting the transmembrane segments of α-helical membrane proteins perform slightly better than that of β-barrel outer membrane proteins in all measures of accuracy. With the same rationale, a metaoanalysis of the performance of the secondary structure prediction algorithms indicates that existing algorithmic techniques cannot be further improved by just adding more non-homologous sequences to the training sets. The upper limit for secondary structure prediction is estimated to be no more than 70% and 80% of correctly predicted residues for single sequence based methods and multiple sequence based ones, respectively. Therefore, we should concentrate our efforts on utilizing new techniques for the development of even better scoring predictors.展开更多
文摘The advantages and disadvantages of genetic algorithm and BP algorithm are introduced. A neural network based on GA-BP algorithm is proposed and applied in the prediction of protein secondary structure, which combines the advantages of BP and GA. The prediction and training on the neural network are made respectively based on 4 structure classifications of protein so as to get higher rate of predication---the highest prediction rate 75.65%,the average prediction rate 65.04%.
文摘The algorithm based on combination learning usually is superior to a singleclassification algorithm on the task of protein secondary structure prediction. However,the assignment of the weight of the base classifier usually lacks decision-makingevidence. In this paper, we propose a protein secondary structure prediction method withdynamic self-adaptation combination strategy based on entropy, where the weights areassigned according to the entropy of posterior probabilities outputted by base classifiers.The higher entropy value means a lower weight for the base classifier. The final structureprediction is decided by the weighted combination of posterior probabilities. Extensiveexperiments on CB513 dataset demonstrates that the proposed method outperforms theexisting methods, which can effectively improve the prediction performance.
文摘The secondary structure of a protein is critical for establishing a link between the protein primary and tertiary structures.For this reason,it is important to design methods for accurate protein secondary structure prediction.Most of the existing computational techniques for protein structural and functional prediction are based onmachine learning with shallowframeworks.Different deep learning architectures have already been applied to tackle protein secondary structure prediction problem.In this study,deep learning based models,i.e.,convolutional neural network and long short-term memory for protein secondary structure prediction were proposed.The input to proposed models is amino acid sequences which were derived from CulledPDB dataset.Hyperparameter tuning with cross validation was employed to attain best parameters for the proposed models.The proposed models enables effective processing of amino acids and attain approximately 87.05%and 87.47%Q3 accuracy of protein secondary structure prediction for convolutional neural network and long short-term memory models,respectively.
文摘Protein structure prediction is one of the most essential objectives practiced by theoretical chemistry and bioinformatics as it is of a vital importance in medicine,biotechnology and more.Protein secondary structure prediction(PSSP)has a significant role in the prediction of protein tertiary structure,as it bridges the gap between the protein primary sequences and tertiary structure prediction.Protein secondary structures are classified into two categories:3-state category and 8-state category.Predicting the 3 states and the 8 states of secondary structures from protein sequences are called the Q3 prediction and the Q8 prediction problems,respectively.The 8 classes of secondary structures reveal more precise structural information for a variety of applications than the 3 classes of secondary structures,however,Q8 prediction has been found to be very challenging,that is why all previous work done in PSSP have focused on Q3 prediction.In this paper,we develop an ensemble Machine Learning(ML)approach for Q8 PSSP to explore the performance of ensemble learning algorithms compared to that of individual ML algorithms in Q8 PSSP.The ensemble members considered for constructing the ensemble models are well known classifiers,namely SVM(Support Vector Machines),KNN(K-Nearest Neighbor),DT(Decision Tree),RF(Random Forest),and NB(Naïve Bayes),with two feature extraction techniques,namely LDA(Linear Discriminate Analysis)and PCA(Principal Component Analysis).Experiments have been conducted for evaluating the performance of single models and ensemble models,with PCA and LDA,in Q8 PSSP.The novelty of this paper lies in the introduction of ensemble learning in Q8 PSSP problem.The experimental results confirmed that ensemble ML models are more accurate than individual ML models.They also indicated that features extracted by LDA are more effective than those extracted by PCA.
文摘Protein Secondary Structure Prediction (PSSP) is considered as one of the major challenging tasks in bioinformatics, so many solutions have been proposed to solve that problem via trying to achieve more accurate prediction results. The goal of this paper is to develop and implement an intelligent based system to predict secondary structure of a protein from its primary amino acid sequence by using five models of Neural Network (NN). These models are Feed Forward Neural Network (FNN), Learning Vector Quantization (LVQ), Probabilistic Neural Network (PNN), Convolutional Neural Network (CNN), and CNN Fine Tuning for PSSP. To evaluate our approaches two datasets have been used. The first one contains 114 protein samples, and the second one contains 1845 protein samples.
基金Project supported by the National Natural Science Foundation of China (Grant Nos 90103031, 10474041, 90403120 and 10021001), and the Nonlinear Project (973) of the NSM.
文摘The folding dynamics and structural characteristics of peptides RTKAWNRQLYPEW (P1) and RTKQLYPEW (P2) are investigated by using all-atomic simulation procedure CHARMM in this work. The results show that P1, a segment of an antigen, has a folding motif of α-helix, whereas P2, which is derived by deleting four residues AWNR from peptide P1, prevents the formation of helix and presents a β-strand. And peptlde P1 experiences a more rugged energy landscape than peptide P2. From our results, it is inferred that the antibody CD8 cytolytic T lymphocyte prefers an antigen with a β-folding structure to that with an α-helical one.
基金This work was supported by the National Natural Science Foundation of China(Grant No.60373100)The High Technology Research and Development Programme of China(Grant No.2002AA117010-09).
文摘A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship be-tween amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein sec-ondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact cal-culation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary struc-tures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/bi-ology/index.html.
基金the National Natural Science Foundation of China (Grant No. 31400709 to X. C.)National Key Technology Support Program of China (Grant No. 2013BAK06B08)+1 种基金Scientific Research Fund of Zhejiang Provincial Education Department (China)(Grant No. Y201432207 to X. C.)Natural Science Fund of Jiangsu Province (China)(Grant No: BK20130187).
文摘prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s mechanisms are too complex to be able to extract clear and straightforward physical meanings from it. This paper explores population-based incremental learning (PBIL), which is a method that combines the mechanisms of a generational genetic algorithm with simple competitive learning. The result shows that its accuracies are particularly associated with the Homo species. This new perspective reveals a number of different possibilities for the purposes of performance improvements.
基金This research was partially funded by Tripos Inc.+2 种基金 the United States Department of Energy through the Sandia National Laboratories LDRD program and ASCI VIEWS Data Discovery Program contract number DE-AC04-76D000789 and the National Science Foundati
文摘Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. There is certainly some noise in the data and there may be significant noise. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A sufficient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the effectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then ex-tend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.
基金PGB was supported by a scholarship from the State Scholarships Foundation of Greece (SSF) for postdoctoral research in the Department of Cell Biology and Biophysics of the University of Athens (Machine Learning Algorithms for Bioinformatics)
文摘It has been shown that the progress in the determination of membrane protein structure grows exponentially, with approximately the same growth rate as that of the water-soluble proteins. In order to investigate the effect of this, on the performance of prediction algorithms for both α-helical and β-barrel membrane proteins, we conducted a prospective study based on historical records. We trained separate hidden Markov models with different sized training sets and evaluated their performance on topology prediction for the two classes of transmembrane proteins. We show that the existing top-scoring algorithms for predicting the transmembrane segments of α-helical membrane proteins perform slightly better than that of β-barrel outer membrane proteins in all measures of accuracy. With the same rationale, a metaoanalysis of the performance of the secondary structure prediction algorithms indicates that existing algorithmic techniques cannot be further improved by just adding more non-homologous sequences to the training sets. The upper limit for secondary structure prediction is estimated to be no more than 70% and 80% of correctly predicted residues for single sequence based methods and multiple sequence based ones, respectively. Therefore, we should concentrate our efforts on utilizing new techniques for the development of even better scoring predictors.