摘要
为探索不同缺失程度大豆蛋白质亚细胞定位预测的有效方法,提升大豆蛋白质亚细胞定位预测能力,本研究以1万条已知亚细胞定位位置的大豆蛋白质序列数据为研究对象,进行5%、10%、15%、20%和30%不同缺失比例完全随机缺失,分别运用SVM算法、朴素贝叶斯算法和随机森林算法和决策树4种机器学习算法预测缺失序列的亚细胞位置,对原始位置和预测后的位置进行相关性分析,对比分析不同算法的准确性和性能。结果显示:随机森林算法预测的准确率最高;朴素贝叶斯算法的运行速度最快;朴素贝叶斯算法的运行内存最小。在不考虑运行时间和运行内存因素,且对预测的准确率要求较高的情况下,随机森林算法的预测效果要优于另外3种算法;同种情况下,若对运行内存要求较高时,可优先考虑朴素贝叶斯算法。结果说明不同机器学习方法在不同缺失程度的预测需求下的适用性,可应用于大豆蛋白质数据的定位预测。
In order to explore an effective method for predicting the subcellular localization of soybean protein with different degrees of deletion, and improve the prediction ability of soybean protein subcellular localization, this study took 10 000 soybean protein sequence data with known subcellular localization positions as the research object, and carried out 5%, 10%, 15%, 20% and 30% sequences missing at random. Four machine learning methods, namely SVM algorithm, Naive Bayes algorithm, Random Forest algorithm and Decision Tree algorithm, were used to predict the subcellular position of the missing sequence. Correlation analysis was performed between the original position and the predicted position, and the accuracy and performance of different algorithms were compared and analyzed. The results showed that the prediction accuracy of Random Forest algorithm was the highest, the running speed of Naive Bayes algorithm was the fastest, and the running memory of Naive Bayes algorithm was the smallest. When the running time and running memory factors were not considered, and the prediction accuracy was high, the prediction effect of the random forest algorithm was better than the other three algorithms. In the same situation, if the running memory requirements are high, the Naive Bayes algorithm may be preferred. The results show the applicability of different machine learning methods under the prediction requirements of different degrees of missingness, and can be applied to the localization prediction of soybean protein data.
作者
李佳楠
高兴泉
李卓
滕小华
黄斌
张继成
唐友
LI Jia-nan;GAO Xing-quan;LI Zhuo;TENG Xiao-hua;HUANG Bin;ZHANG Ji-cheng;TANG You(Electrical and Information Engineering College,Jilin Agricultural Science and Technology University,Jilin 132101,China;School of Information and Control Engineering,Jilin Institute of Chemical Technology,Jilin 132000,China;College of Electronic and Information,Northeast Agricultural University,Harbin 150030,China)
出处
《大豆科学》
CAS
CSCD
北大核心
2022年第3期337-344,共8页
Soybean Science
基金
吉林省特色高水平学科新兴交叉学科“数字农业”(2018)
吉林省智慧农业工程研究中心项目(2016)
国家自然科学基金(31801441)。
关键词
支持向量机算法
朴素贝叶斯算法
决策树算法
随机森林算法
大豆蛋白质
完全随机缺失
序列位置预测
Support Vector Machines algorithm
Naive Bayesian algorithm
Decision Tree algorithm
Random Forest algorithm
soybean protein
completely random missing
sequence position prediction