The secondary structure of a protein is critical for establishing a link between the protein primary and tertiary structures.For this reason,it is important to design methods for accurate protein secondary structure p...The secondary structure of a protein is critical for establishing a link between the protein primary and tertiary structures.For this reason,it is important to design methods for accurate protein secondary structure prediction.Most of the existing computational techniques for protein structural and functional prediction are based onmachine learning with shallowframeworks.Different deep learning architectures have already been applied to tackle protein secondary structure prediction problem.In this study,deep learning based models,i.e.,convolutional neural network and long short-term memory for protein secondary structure prediction were proposed.The input to proposed models is amino acid sequences which were derived from CulledPDB dataset.Hyperparameter tuning with cross validation was employed to attain best parameters for the proposed models.The proposed models enables effective processing of amino acids and attain approximately 87.05%and 87.47%Q3 accuracy of protein secondary structure prediction for convolutional neural network and long short-term memory models,respectively.展开更多
A simple stepwise folding process has been developed to simulate RNA secondary structure formation.Modifications for the energy parameters of various loops were included in the program.Five possible types of pseudokno...A simple stepwise folding process has been developed to simulate RNA secondary structure formation.Modifications for the energy parameters of various loops were included in the program.Five possible types of pseudoknots including the well known H-type pseudoknot were permitted to occur if reasonable.We have applied this approach to e number of RNA sequences.The prediction accuracies we obtained were higher than those in published papers.展开更多
The advantages and disadvantages of genetic algorithm and BP algorithm are introduced. A neural network based on GA-BP algorithm is proposed and applied in the prediction of protein secondary structure, which combines...The advantages and disadvantages of genetic algorithm and BP algorithm are introduced. A neural network based on GA-BP algorithm is proposed and applied in the prediction of protein secondary structure, which combines the advantages of BP and GA. The prediction and training on the neural network are made respectively based on 4 structure classifications of protein so as to get higher rate of predication---the highest prediction rate 75.65%,the average prediction rate 65.04%.展开更多
Protein Secondary Structure Prediction (PSSP) is considered as one of the major challenging tasks in bioinformatics, so many solutions have been proposed to solve that problem via trying to achieve more accurate predi...Protein Secondary Structure Prediction (PSSP) is considered as one of the major challenging tasks in bioinformatics, so many solutions have been proposed to solve that problem via trying to achieve more accurate prediction results. The goal of this paper is to develop and implement an intelligent based system to predict secondary structure of a protein from its primary amino acid sequence by using five models of Neural Network (NN). These models are Feed Forward Neural Network (FNN), Learning Vector Quantization (LVQ), Probabilistic Neural Network (PNN), Convolutional Neural Network (CNN), and CNN Fine Tuning for PSSP. To evaluate our approaches two datasets have been used. The first one contains 114 protein samples, and the second one contains 1845 protein samples.展开更多
The algorithm based on combination learning usually is superior to a singleclassification algorithm on the task of protein secondary structure prediction. However,the assignment of the weight of the base classifier us...The algorithm based on combination learning usually is superior to a singleclassification algorithm on the task of protein secondary structure prediction. However,the assignment of the weight of the base classifier usually lacks decision-makingevidence. In this paper, we propose a protein secondary structure prediction method withdynamic self-adaptation combination strategy based on entropy, where the weights areassigned according to the entropy of posterior probabilities outputted by base classifiers.The higher entropy value means a lower weight for the base classifier. The final structureprediction is decided by the weighted combination of posterior probabilities. Extensiveexperiments on CB513 dataset demonstrates that the proposed method outperforms theexisting methods, which can effectively improve the prediction performance.展开更多
Protein structure prediction is one of the most essential objectives practiced by theoretical chemistry and bioinformatics as it is of a vital importance in medicine,biotechnology and more.Protein secondary structure ...Protein structure prediction is one of the most essential objectives practiced by theoretical chemistry and bioinformatics as it is of a vital importance in medicine,biotechnology and more.Protein secondary structure prediction(PSSP)has a significant role in the prediction of protein tertiary structure,as it bridges the gap between the protein primary sequences and tertiary structure prediction.Protein secondary structures are classified into two categories:3-state category and 8-state category.Predicting the 3 states and the 8 states of secondary structures from protein sequences are called the Q3 prediction and the Q8 prediction problems,respectively.The 8 classes of secondary structures reveal more precise structural information for a variety of applications than the 3 classes of secondary structures,however,Q8 prediction has been found to be very challenging,that is why all previous work done in PSSP have focused on Q3 prediction.In this paper,we develop an ensemble Machine Learning(ML)approach for Q8 PSSP to explore the performance of ensemble learning algorithms compared to that of individual ML algorithms in Q8 PSSP.The ensemble members considered for constructing the ensemble models are well known classifiers,namely SVM(Support Vector Machines),KNN(K-Nearest Neighbor),DT(Decision Tree),RF(Random Forest),and NB(Naïve Bayes),with two feature extraction techniques,namely LDA(Linear Discriminate Analysis)and PCA(Principal Component Analysis).Experiments have been conducted for evaluating the performance of single models and ensemble models,with PCA and LDA,in Q8 PSSP.The novelty of this paper lies in the introduction of ensemble learning in Q8 PSSP problem.The experimental results confirmed that ensemble ML models are more accurate than individual ML models.They also indicated that features extracted by LDA are more effective than those extracted by PCA.展开更多
A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been ext...A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship be-tween amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein sec-ondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact cal-culation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary struc-tures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/bi-ology/index.html.展开更多
prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s me...prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s mechanisms are too complex to be able to extract clear and straightforward physical meanings from it. This paper explores population-based incremental learning (PBIL), which is a method that combines the mechanisms of a generational genetic algorithm with simple competitive learning. The result shows that its accuracies are particularly associated with the Homo species. This new perspective reveals a number of different possibilities for the purposes of performance improvements.展开更多
The changes of protein secondary structures in the extracellular polymeric substances(EPS) extracted from activated sludge by four different methods were studied by analyzing the amide I region(1700–1600 cm-1) of the...The changes of protein secondary structures in the extracellular polymeric substances(EPS) extracted from activated sludge by four different methods were studied by analyzing the amide I region(1700–1600 cm-1) of the Fourier transform infrared spectra and model protein test. The results showed the molecular weight distribution of organic matter extracted by centrifugation, heating and cation exchange resin(CER) was similar, while the EPS extracted by centrifugation(Control) and CER had similar fluorescent organic matter. The protein secondary structures of extracted EPS by the four methods were different. The similarities of protein secondary structures between the EPS extracted by CER with the Control were the highest among the four extracted EPS. Although the EPS yield extracted by formaldehyde + NaOH method were the highest, its protein secondary structures had the lowest similarity with those extracted by the Control. Additionally, the effects of centrifugation and CER extraction on the secondary structures of bovine serum albumin were also lower than that of other extraction processes. CER enables the second maximum extraction of EPS and maximum retention of the original secondary structure of proteins.展开更多
The folding dynamics and structural characteristics of peptides RTKAWNRQLYPEW (P1) and RTKQLYPEW (P2) are investigated by using all-atomic simulation procedure CHARMM in this work. The results show that P1, a segm...The folding dynamics and structural characteristics of peptides RTKAWNRQLYPEW (P1) and RTKQLYPEW (P2) are investigated by using all-atomic simulation procedure CHARMM in this work. The results show that P1, a segment of an antigen, has a folding motif of α-helix, whereas P2, which is derived by deleting four residues AWNR from peptide P1, prevents the formation of helix and presents a β-strand. And peptlde P1 experiences a more rugged energy landscape than peptide P2. From our results, it is inferred that the antibody CD8 cytolytic T lymphocyte prefers an antigen with a β-folding structure to that with an α-helical one.展开更多
为满足不同种类食品对大豆分离蛋白(soybean protein isolate,SPI)不同功能性的需求,本研究利用红外光谱快速采集70组不同pH值处理后SPI的数据,探讨pH值变化对SPI结构含量的影响。使用均值中心化、多元散射校正、标准正态变量变换和归...为满足不同种类食品对大豆分离蛋白(soybean protein isolate,SPI)不同功能性的需求,本研究利用红外光谱快速采集70组不同pH值处理后SPI的数据,探讨pH值变化对SPI结构含量的影响。使用均值中心化、多元散射校正、标准正态变量变换和归一化算法对红外光谱数据进行预处理,基于二维相关红外光谱提取特征波段,再利用偏最小二乘(partial least square,PLS)法和算术优化算法-随机森林(arithmetic optimization algorithm-random forests,AOA-RF)建立不同pH值条件下SPI结构及含量的预测模型。结果表明,经均值中心化和多元散射校正结合处理后,α-螺旋、β-折叠、β-转角和无规卷曲模型的相对标准偏差分别为1.29%、1.60%、1.37%、7.28%,两者结合对光谱数据的预处理效果最佳。预测α-螺旋和β-折叠含量最优模型为AOA-RF(特征波段),校正集决定系数为0.9350和0.9266,预测集决定系数为0.8568和0.8701;预测β-转角和无规卷曲含量最优模型为PLS(特征波段),校正集决定系数为0.9154和0.8817,预测集决定系数为0.8913和0.7843。本研究结果可为工业生产过程中产品质量快速检测和工艺条件控制提供理论支撑。展开更多
RNA二级结构预测是计算分子生物学中的一个重要领域.本文介绍了RNA二级结构的预测方法,包括该问题的数学模型、主要算法思想以及每种算法对应的软件.在tRNA和RNase P RNA数据库中随机选取了几组样例对目前主要的7种软件进行测试,同时对...RNA二级结构预测是计算分子生物学中的一个重要领域.本文介绍了RNA二级结构的预测方法,包括该问题的数学模型、主要算法思想以及每种算法对应的软件.在tRNA和RNase P RNA数据库中随机选取了几组样例对目前主要的7种软件进行测试,同时对每种软件的优缺点进行了详细比较.实验证明,当存在同源序列时,Pfold的效果优于其它软件.最后,在总结分析现有算法的基础上探讨了该领域进一步的研究方向.展开更多
文摘The secondary structure of a protein is critical for establishing a link between the protein primary and tertiary structures.For this reason,it is important to design methods for accurate protein secondary structure prediction.Most of the existing computational techniques for protein structural and functional prediction are based onmachine learning with shallowframeworks.Different deep learning architectures have already been applied to tackle protein secondary structure prediction problem.In this study,deep learning based models,i.e.,convolutional neural network and long short-term memory for protein secondary structure prediction were proposed.The input to proposed models is amino acid sequences which were derived from CulledPDB dataset.Hyperparameter tuning with cross validation was employed to attain best parameters for the proposed models.The proposed models enables effective processing of amino acids and attain approximately 87.05%and 87.47%Q3 accuracy of protein secondary structure prediction for convolutional neural network and long short-term memory models,respectively.
文摘A simple stepwise folding process has been developed to simulate RNA secondary structure formation.Modifications for the energy parameters of various loops were included in the program.Five possible types of pseudoknots including the well known H-type pseudoknot were permitted to occur if reasonable.We have applied this approach to e number of RNA sequences.The prediction accuracies we obtained were higher than those in published papers.
文摘The advantages and disadvantages of genetic algorithm and BP algorithm are introduced. A neural network based on GA-BP algorithm is proposed and applied in the prediction of protein secondary structure, which combines the advantages of BP and GA. The prediction and training on the neural network are made respectively based on 4 structure classifications of protein so as to get higher rate of predication---the highest prediction rate 75.65%,the average prediction rate 65.04%.
文摘Protein Secondary Structure Prediction (PSSP) is considered as one of the major challenging tasks in bioinformatics, so many solutions have been proposed to solve that problem via trying to achieve more accurate prediction results. The goal of this paper is to develop and implement an intelligent based system to predict secondary structure of a protein from its primary amino acid sequence by using five models of Neural Network (NN). These models are Feed Forward Neural Network (FNN), Learning Vector Quantization (LVQ), Probabilistic Neural Network (PNN), Convolutional Neural Network (CNN), and CNN Fine Tuning for PSSP. To evaluate our approaches two datasets have been used. The first one contains 114 protein samples, and the second one contains 1845 protein samples.
文摘The algorithm based on combination learning usually is superior to a singleclassification algorithm on the task of protein secondary structure prediction. However,the assignment of the weight of the base classifier usually lacks decision-makingevidence. In this paper, we propose a protein secondary structure prediction method withdynamic self-adaptation combination strategy based on entropy, where the weights areassigned according to the entropy of posterior probabilities outputted by base classifiers.The higher entropy value means a lower weight for the base classifier. The final structureprediction is decided by the weighted combination of posterior probabilities. Extensiveexperiments on CB513 dataset demonstrates that the proposed method outperforms theexisting methods, which can effectively improve the prediction performance.
文摘Protein structure prediction is one of the most essential objectives practiced by theoretical chemistry and bioinformatics as it is of a vital importance in medicine,biotechnology and more.Protein secondary structure prediction(PSSP)has a significant role in the prediction of protein tertiary structure,as it bridges the gap between the protein primary sequences and tertiary structure prediction.Protein secondary structures are classified into two categories:3-state category and 8-state category.Predicting the 3 states and the 8 states of secondary structures from protein sequences are called the Q3 prediction and the Q8 prediction problems,respectively.The 8 classes of secondary structures reveal more precise structural information for a variety of applications than the 3 classes of secondary structures,however,Q8 prediction has been found to be very challenging,that is why all previous work done in PSSP have focused on Q3 prediction.In this paper,we develop an ensemble Machine Learning(ML)approach for Q8 PSSP to explore the performance of ensemble learning algorithms compared to that of individual ML algorithms in Q8 PSSP.The ensemble members considered for constructing the ensemble models are well known classifiers,namely SVM(Support Vector Machines),KNN(K-Nearest Neighbor),DT(Decision Tree),RF(Random Forest),and NB(Naïve Bayes),with two feature extraction techniques,namely LDA(Linear Discriminate Analysis)and PCA(Principal Component Analysis).Experiments have been conducted for evaluating the performance of single models and ensemble models,with PCA and LDA,in Q8 PSSP.The novelty of this paper lies in the introduction of ensemble learning in Q8 PSSP problem.The experimental results confirmed that ensemble ML models are more accurate than individual ML models.They also indicated that features extracted by LDA are more effective than those extracted by PCA.
基金This work was supported by the National Natural Science Foundation of China(Grant No.60373100)The High Technology Research and Development Programme of China(Grant No.2002AA117010-09).
文摘A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship be-tween amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein sec-ondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact cal-culation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary struc-tures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/bi-ology/index.html.
基金the National Natural Science Foundation of China (Grant No. 31400709 to X. C.)National Key Technology Support Program of China (Grant No. 2013BAK06B08)+1 种基金Scientific Research Fund of Zhejiang Provincial Education Department (China)(Grant No. Y201432207 to X. C.)Natural Science Fund of Jiangsu Province (China)(Grant No: BK20130187).
文摘prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s mechanisms are too complex to be able to extract clear and straightforward physical meanings from it. This paper explores population-based incremental learning (PBIL), which is a method that combines the mechanisms of a generational genetic algorithm with simple competitive learning. The result shows that its accuracies are particularly associated with the Homo species. This new perspective reveals a number of different possibilities for the purposes of performance improvements.
基金supported by the Major Science and Technology Program for Water Pollution Control and Treatment of China (Nos. 2017ZX07106003-002 and 2017ZX07102004-002)
文摘The changes of protein secondary structures in the extracellular polymeric substances(EPS) extracted from activated sludge by four different methods were studied by analyzing the amide I region(1700–1600 cm-1) of the Fourier transform infrared spectra and model protein test. The results showed the molecular weight distribution of organic matter extracted by centrifugation, heating and cation exchange resin(CER) was similar, while the EPS extracted by centrifugation(Control) and CER had similar fluorescent organic matter. The protein secondary structures of extracted EPS by the four methods were different. The similarities of protein secondary structures between the EPS extracted by CER with the Control were the highest among the four extracted EPS. Although the EPS yield extracted by formaldehyde + NaOH method were the highest, its protein secondary structures had the lowest similarity with those extracted by the Control. Additionally, the effects of centrifugation and CER extraction on the secondary structures of bovine serum albumin were also lower than that of other extraction processes. CER enables the second maximum extraction of EPS and maximum retention of the original secondary structure of proteins.
基金Project supported by the National Natural Science Foundation of China (Grant Nos 90103031, 10474041, 90403120 and 10021001), and the Nonlinear Project (973) of the NSM.
文摘The folding dynamics and structural characteristics of peptides RTKAWNRQLYPEW (P1) and RTKQLYPEW (P2) are investigated by using all-atomic simulation procedure CHARMM in this work. The results show that P1, a segment of an antigen, has a folding motif of α-helix, whereas P2, which is derived by deleting four residues AWNR from peptide P1, prevents the formation of helix and presents a β-strand. And peptlde P1 experiences a more rugged energy landscape than peptide P2. From our results, it is inferred that the antibody CD8 cytolytic T lymphocyte prefers an antigen with a β-folding structure to that with an α-helical one.
文摘为满足不同种类食品对大豆分离蛋白(soybean protein isolate,SPI)不同功能性的需求,本研究利用红外光谱快速采集70组不同pH值处理后SPI的数据,探讨pH值变化对SPI结构含量的影响。使用均值中心化、多元散射校正、标准正态变量变换和归一化算法对红外光谱数据进行预处理,基于二维相关红外光谱提取特征波段,再利用偏最小二乘(partial least square,PLS)法和算术优化算法-随机森林(arithmetic optimization algorithm-random forests,AOA-RF)建立不同pH值条件下SPI结构及含量的预测模型。结果表明,经均值中心化和多元散射校正结合处理后,α-螺旋、β-折叠、β-转角和无规卷曲模型的相对标准偏差分别为1.29%、1.60%、1.37%、7.28%,两者结合对光谱数据的预处理效果最佳。预测α-螺旋和β-折叠含量最优模型为AOA-RF(特征波段),校正集决定系数为0.9350和0.9266,预测集决定系数为0.8568和0.8701;预测β-转角和无规卷曲含量最优模型为PLS(特征波段),校正集决定系数为0.9154和0.8817,预测集决定系数为0.8913和0.7843。本研究结果可为工业生产过程中产品质量快速检测和工艺条件控制提供理论支撑。
文摘RNA二级结构预测是计算分子生物学中的一个重要领域.本文介绍了RNA二级结构的预测方法,包括该问题的数学模型、主要算法思想以及每种算法对应的软件.在tRNA和RNase P RNA数据库中随机选取了几组样例对目前主要的7种软件进行测试,同时对每种软件的优缺点进行了详细比较.实验证明,当存在同源序列时,Pfold的效果优于其它软件.最后,在总结分析现有算法的基础上探讨了该领域进一步的研究方向.