摘要
基于对β-发夹模体的预测探索,本文使用随机森林和支持向量机两种算法,对ArchDB40数据库及自建数据集中的β-发夹模体进行预测.对于同一数据集,在特征参数和检验方法均相同的情况下,随机森林算法的预测精度要高于支持向量机算法.此外,由于随机森林算法在参数维数较高的情况下不会发生过拟合现象,所以本文采用了将高维特征参数输入随机森林算法的方法来预测β-发夹,得到了较好的预测效果:对ArchDB40数据库中的β-发夹进行预测,其5-交叉检验的预测精度和相关系数分别是83.3%和0.59;对自建数据集中的β-发夹进行预测,其5-交叉检验的预测精度和相关系数分别是85.2%和0.62.
Based on the prediction exploration of β-hairpin motifs in proteins, the random forest and support vector machine algorithm is applied in this paper to predict β-hairpin motifs in Arch DB40(Specific database name) and the self-built dataset. For the same dataset, when using the same characteristic parameters and the same test method, Random Forest algorithm is more accurate than Support Vector Machine. In addition, Random Forest algorithm never results in the overfitting phenomenon under the higher dimension of characteristic parameters, so the Random Forest based on higher dimension characteristic parameters is applied to predict β-hairpin motifs. The better prediction results are obtained: 1. Prediction of β-hairpin motifs in Arch DB40 dataset, the overall accuracy and Matthew's correlation coefficient of 5-fold cross-validation achieve 83.3% and 0.59 respectively; 2. Prediction of β-hairpin motifs in the self-built dataset, the overall accuracy and Matthew's correlation coefficient of 5-fold cross-validation achieve 85.2% and 0.62, respectively.
出处
《温州大学学报(自然科学版)》
2016年第3期26-33,共8页
Journal of Wenzhou University(Natural Science Edition)
关键词
随机森林算法
支持向量机算法
Β-发夹模体
离散增量
预测的二级结构信息
Random Forest Algorithm
Support Vector Machine(SVM) Algorithm
β-hairpin Motif
Increment of Diversity
Predicted Secondary Structure Information