摘要
从GenBank数据库中获取了微生物来源的嗜热脂肪酶序列77条,耐热脂肪酶序列65条,分别统计分析序列中20种氨基酸出现的频次,二肽片段、三肽片段出现的差异以及非相邻二元组合的偏爱性。在此基础上,利用支持向量机(SVM)进行序列分类研究。研究结果表明:在统计学意义上,20种天然氨基酸残基中,亮氨酸、脯氨酸、蛋氨酸、苯丙氨酸、色氨酸和酪氨酸在嗜热蛋白序列中出现的频率高于其在耐热蛋白中出现的频率;二肽片段KC,EE,KE,RE,VE,YI,EK,VK,EV,YV,EY,KY,VY和YY的出现频率在嗜热蛋白中显著高于其在耐热蛋白中出现的频率。三肽片段的出现频率和非相邻二元组合的序列偏爱性也显示与蛋白耐热性显著相关。训练集的分类准确率达99.65%,真实数据集的分类准确率达到98.41%。
The amino acid compositions,the distributions of N(N=2,3) neighboring amino acids and the non-adjacent di-residue coupling patterns in the sequences of 65 thermostable and 77 thermophilic lipases getting from GenBank were systematically analyzed.Based on the information,a statistical method based on support vector machines(SVMs) for discriminating thermophilic and thermostable lipases was developed.The results show that hydrophobic residues Leu,Pro,Met,Phe,Trp,as well as the polar residue Tyr have higher occurrences in thermophilic lipases than thermostable ones.The occurrences of KC,EE,KE,RE,VE,YI,EK,VK,EV,YV,EY,KY,VY and YY in thermophilic proteins are significantly more frequent.The composition of dipeptide,tripeptide and non-adjacent di-residue patterns contain more information than amino acid composition,and this information indicates the possible thermostable mechanism of microbial lipases.The accuracy of this method for the training dataset is 99.65%,and its accuracy for testing datasets is 98.41%.
出处
《中南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2011年第9期2543-2550,共8页
Journal of Central South University:Science and Technology
基金
国家自然科学基金资助项目(31000350)
关键词
氨基酸组成
多肽片段
非相邻二元组合
蛋白质热稳定性
支持向量机
amino acid composition
n-peptide composition
di-residue coupling
protein stability
support vector machines