摘要
基于一级结构信息预测蛋白质热稳定性,对于利用计算机筛选热稳定性蛋白具有重要意义。本文采用k-近邻算法从序列出发预测蛋白质的热稳定性,用自一致性检验、交叉验证和独立样本测试等三种方法评估。仅用20种氨基酸组成作为特征变量时,识别的正确率分别可达100%,87.7%和89.6%;而引入8个新变量后,其精度分别为100%,89.6%和90.2%,对小蛋白质分子识别的精度提高了2.4%。同时探讨了蛋白质分子大小对识别效果的影响。
The identification of the thermostability from the amino acid sequence information would be helpful in computational screening for thermostable proteins. The k-Nearest Neighbors (kNN) classifiers were applied to discriminate thermophilie and mesophilie proteins. Three methods, namely, self-consistency test, 5-fold cross-validation and independent testing with other dataset, were used to evaluate the performance and robust of the models. When 50 amino acid composition were used as variables, it achieved overall accuracy of 100% , 89.6% and 90. 2% , respectively. When another 8 variables were added, the overall accuracy was 100% , 89. 6% and 90.5% , the prediction accuracy for the small-size protein improved 2.4%. The influence of protein size on prediction accuracy was also addressed.
出处
《计算机与应用化学》
CAS
CSCD
北大核心
2008年第1期39-41,共3页
Computers and Applied Chemistry
基金
"863"计划资助项目(2006AA020102)
国务院侨办科研基金(No.05Q0018).
关键词
K-近邻
蛋白质热稳定性
模式识别
计算机筛选
k-nearest neighbors, protein thermostability, pattern recognition, computational screening