The research methods of protein structure prediction mainly focus on finding effective features of protein sequences and developing suitable machine learning algorithms. But few people consider the importance of weigh...The research methods of protein structure prediction mainly focus on finding effective features of protein sequences and developing suitable machine learning algorithms. But few people consider the importance of weights of features in classification. We propose the GASVM algorithm (classification accuracy of support vector machine is regarded as the fitness value of genetic algorithm) to optimize the coefficients of these 16 features (5 features are proposed first time) in the classification, and further develop a new feature vector. Finally, based on the new feature vector, this paper uses support vector machine and 10-fold cross-validation to classify the protein structure of 3 low similarity datasets (25PDB, 1189, FC699). Experimental results show that the overall classification accuracy of the new method is better than other methods.展开更多
Residue-residue contacts are very important in forming protein structure. In this work, we calculated the average probability of residue-residue contacts in 470 globular proteins and analyzed the distribution of cont...Residue-residue contacts are very important in forming protein structure. In this work, we calculated the average probability of residue-residue contacts in 470 globular proteins and analyzed the distribution of contacts in the different interval of residues using Contacts of Structural Units (CSU) and Structural Classification (SCOP) software. It was found that the relationship between the average probability PL and the residue distance L for four structural classes of proteins could be expressed as lgPL=a+b×L, where a and b are coefficients. We also discussed the connection between two aspects of proteins which have equal array residue number and found that the distribution probability was stable (or un- stable) if the proteins had the same (or different) compact (for example synthase) in the same structural class.展开更多
The analysis of residue-residue contacts in protein structures can shed some light on our understanding of the folding and stability of proteins. In this paper, we study the statistical properties of long-range and sh...The analysis of residue-residue contacts in protein structures can shed some light on our understanding of the folding and stability of proteins. In this paper, we study the statistical properties of long-range and short-range residue- residue contacts of 91 globular proteins using CSU software and analyze the importance of long-range contacts in globular protein structure. There are many short-range and long-range contacts in globular proteins, and it is found that the average number of long-range contacts per residue is 5.63 and the percentage of residue-residue contacts which are involved in long- range ones is 59.4%. In more detail, the distribution of long-range contacts in different residue intervals is investigated and it is found that the residues occurring in the interval range of 4-10 residues apart in the sequence contribute more long-range contacts to the stability of globular protein. The number of long-range contacts per residue, which is a measure of ability to form residue-residue contacts, is also calculated for 20 different amino acid residues. It is shown that hydrophobic residues (including Leu, Val, He, Met, Phe, Tyr, Cys and Trp) having a large number of long-range contacts easily form long-range contacts, while the hydrophilic amino acids (including Ala, Gly, Thr, His, Glu, Gln, Asp, Asn, Lys, Ser, Arg, and Pro) form long-range contacts with more difficulty. The relationship between the Fauchere-Pliska hydrophobicity scale (FPH) and the number of short-range and long-range contacts per residue for 20 amino acid residues is also studied. An approximately linear relationship between the Fauchere-Pliska hydrophobicity scale (FPH) and the number of long-range contacts per residue CL, is found and can be expressed as CL = a + b × FPH where a = 5.04 and b = 1.23. These results can help us to understand the role of residue-residue contacts in globular protein structure.展开更多
Based on the concept of the pseudo amino acid composition (PseAAC), protein structural classes are predicted by using an approach of increment of diversity combined with support vector machine (ID-SVM), in which t...Based on the concept of the pseudo amino acid composition (PseAAC), protein structural classes are predicted by using an approach of increment of diversity combined with support vector machine (ID-SVM), in which the dipeptide amino acid composition of proteins is used as the source of diversity. Jackknife test shows that total prediction accuracy is 96.6% and higher than that given by other approaches. Besides, the specificity (Sp) and the Matthew's correlation coefficient (MCC) are also calculated for each protein structural class, the Sp is more than 88%, the MCC is higher than 92%, and the higher MCC and Sp imply that it is credible to use ID-SVM model predicting protein structural class. The results indicate that: 1 the choice of the source of diversity is reasonable, 2 the predictive performance of IDSVM is excellent, and3 the amino acid sequences of proteins contain information of protein structural classes.展开更多
文摘The research methods of protein structure prediction mainly focus on finding effective features of protein sequences and developing suitable machine learning algorithms. But few people consider the importance of weights of features in classification. We propose the GASVM algorithm (classification accuracy of support vector machine is regarded as the fitness value of genetic algorithm) to optimize the coefficients of these 16 features (5 features are proposed first time) in the classification, and further develop a new feature vector. Finally, based on the new feature vector, this paper uses support vector machine and 10-fold cross-validation to classify the protein structure of 3 low similarity datasets (25PDB, 1189, FC699). Experimental results show that the overall classification accuracy of the new method is better than other methods.
基金Project supported by the National Natural Science Foundation ofChina (Nos. 29874012 20174036+2 种基金 20274040) and the NaturalScience Foundation of Zhejiang Province (No. 10102) and theScience Technology Development Plan of Wenzhou City (No.S2002A0
文摘Residue-residue contacts are very important in forming protein structure. In this work, we calculated the average probability of residue-residue contacts in 470 globular proteins and analyzed the distribution of contacts in the different interval of residues using Contacts of Structural Units (CSU) and Structural Classification (SCOP) software. It was found that the relationship between the average probability PL and the residue distance L for four structural classes of proteins could be expressed as lgPL=a+b×L, where a and b are coefficients. We also discussed the connection between two aspects of proteins which have equal array residue number and found that the distribution probability was stable (or un- stable) if the proteins had the same (or different) compact (for example synthase) in the same structural class.
基金This work was supported by the National Natural Science Foundation of China (Nos. 29874012, 20174036, and20274040), and the Natural Science Foundation of Zhejiang Province (10102) and Science Technology Development Plan of Wenzhou City (S2002A014).
文摘The analysis of residue-residue contacts in protein structures can shed some light on our understanding of the folding and stability of proteins. In this paper, we study the statistical properties of long-range and short-range residue- residue contacts of 91 globular proteins using CSU software and analyze the importance of long-range contacts in globular protein structure. There are many short-range and long-range contacts in globular proteins, and it is found that the average number of long-range contacts per residue is 5.63 and the percentage of residue-residue contacts which are involved in long- range ones is 59.4%. In more detail, the distribution of long-range contacts in different residue intervals is investigated and it is found that the residues occurring in the interval range of 4-10 residues apart in the sequence contribute more long-range contacts to the stability of globular protein. The number of long-range contacts per residue, which is a measure of ability to form residue-residue contacts, is also calculated for 20 different amino acid residues. It is shown that hydrophobic residues (including Leu, Val, He, Met, Phe, Tyr, Cys and Trp) having a large number of long-range contacts easily form long-range contacts, while the hydrophilic amino acids (including Ala, Gly, Thr, His, Glu, Gln, Asp, Asn, Lys, Ser, Arg, and Pro) form long-range contacts with more difficulty. The relationship between the Fauchere-Pliska hydrophobicity scale (FPH) and the number of short-range and long-range contacts per residue for 20 amino acid residues is also studied. An approximately linear relationship between the Fauchere-Pliska hydrophobicity scale (FPH) and the number of long-range contacts per residue CL, is found and can be expressed as CL = a + b × FPH where a = 5.04 and b = 1.23. These results can help us to understand the role of residue-residue contacts in globular protein structure.
基金Supported by the National Natural Science Foundation of China (30660044)
文摘Based on the concept of the pseudo amino acid composition (PseAAC), protein structural classes are predicted by using an approach of increment of diversity combined with support vector machine (ID-SVM), in which the dipeptide amino acid composition of proteins is used as the source of diversity. Jackknife test shows that total prediction accuracy is 96.6% and higher than that given by other approaches. Besides, the specificity (Sp) and the Matthew's correlation coefficient (MCC) are also calculated for each protein structural class, the Sp is more than 88%, the MCC is higher than 92%, and the higher MCC and Sp imply that it is credible to use ID-SVM model predicting protein structural class. The results indicate that: 1 the choice of the source of diversity is reasonable, 2 the predictive performance of IDSVM is excellent, and3 the amino acid sequences of proteins contain information of protein structural classes.