摘要
文献报道采用氨基酸组成分布提取特征值能有效提高预测分类精度,本文采用该方法提取特征值,使用一种新的组合分类器——随机森林,从蛋白质一级结构对嗜热和嗜冷蛋白进行分类。通过10倍交叉验证和独立样本测试两种方法检测,结果表明:当分段数量为1时,其精度最优,分别为92.9%和90.2%,暗示使用基于氨基酸组成分布提取特征值在该算法中并不能有效提高识别精度,这与报道结果不符,而该提取方法在SVM中却能适当提高识别精度;当引入6个新变量后,其精度分别提高到93.2%和92.2%,ROC曲线下面积分别为0.9771和0.9696,优于其它组合分类器。
We used amino acid composition distribution (AACD) to discriminate thermophilic and psychrophilic proteins. We used 10-fold cross-validation and independent testing with other dataset to evaluate the models. The results showed that when the segment was 1, the overall accuracy reached 92.9% and 90.2%, respectively. The AACD method improved the prediction accuracy when support vector machine was used as the classifier. When six new features were introduced, the overall accuracy of random forest improved to 93.2% and 92.2%, the areas under the receiver operation characteristic curve were 0.9771 and 0.9696, which was better than other ensemble classifiers and comparable with that of SVM.
出处
《生物工程学报》
CAS
CSCD
北大核心
2008年第2期302-308,共7页
Chinese Journal of Biotechnology
基金
"973计划"(No.2007CB707804)
~~福建省自然科学基金(No.2007J0360)资助项目~~
关键词
随机森林
氨基酸组成分布
嗜热和嗜冷蛋白
ROC曲线
Random forest, amino acid composition distribution, thermophilic and psychrophilic protein, ROC curve