期刊文献+

基于随机森林和特征选择方法的蛋白质热稳定性影响因素预测

Prediction of the Influencing Factors of Protein Thermal Stability using Random Forest and Feature Selection Techniques
原文传递
导出
摘要 酶的耐热性对其在食品工业中实现应用至关重要。本文以随机森林算法通过蛋白质序列预测酶的热稳定性,并对影响热稳定性的重要特征进行了分析。计算了从Swiss-Prot数据库获得的1600个包含热稳定性信息的酶的430个特征。采用重复欠抽样法处理数据不平衡问题,采用向后递归特征消去法优选出30个最重要的特征。通过交叉验证和独立测试比较以各特征子集构建的模型,发现仅以氨基酸组成为特征集构建的模型获得了最佳预测效果,模型的总体预测准确率为85.83%、敏感性为89.16%、特异性为73.33%、精度为77.00%、F1度量为74.87%。结果表明氨基酸组成对酶热稳定性的影响最大,嗜热酶中含有更多的谷氨酸、异亮氨酸和赖氨酸,而常温酶中含有更多的谷氨酰胺、丝氨酸和苏氨酸。研究为蛋白质工程改造食品工业用酶的热稳定性提供了一定的理论和方法。 Thermal stability is crucial for implementation of an enzyme in the food industry.The thermostability of enzymes were predicted through protein sequences,using a random forest algorithm and the important influencing factors on protein thermal stability were analyzed.Four hundred and thirty protein features were calculated for 1600 enzymes extracted from the Swiss-Prot database that contained thermal stability information.The data imbalance was solved by using repeated under-sampling methods,and the 30 most-important features were selected by backward recursive feature elimination(RFE).The classification performances of different random forest models built by different feature subsets were evaluated by cross-validation and independent testing.The results indicated that the model built by amino acid composition exhibited the best performance(accuracy = 85.83%,sensitivity = 89.16%,specificity = 73.33%,precision = 77.00%,and F-measure = 74.87%),suggesting that amino acid composition had the most significant impact on the thermal stability of an enzyme.Further,it was found that thermophilic enzymes contained relatively high contents of glutamic acid,isoleucine,and lysine,whereas mesophilic enzymes contained high contents of glutamine,serine,and threonine.The results in this study provided a theory and method for engineering proteins to improve enzyme thermostability for the food industry.
出处 《现代食品科技》 EI CAS 北大核心 2016年第7期103-108,共6页 Modern Food Science and Technology
基金 辽宁省教育厅基金资助项目(L2014001) 辽宁省科技厅基金资助项目(2014001015 2013225086) 沈阳市科技局科技攻关专项(F14-154-9-00) 国家自然科学基金资助项目(31570160)
关键词 酶热稳定性 随机森林 特征选择 氨基酸组成 enzyme thermostability random forest feature selection amino acid composition
  • 相关文献

参考文献2

二级参考文献34

共引文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部