摘要
本文以实际医疗数据为应用对象,运用logistic、支持向量机、随机森林分类模型进行试验,对原始数据进行分析并做出预测。运用logistic、随机森林找到对患心脏病影响较大的因素,如家族性因素和累计吸烟量,进而提出有针对性的建议,并采用交叉验证法寻找支持向量机算法的最佳核函数和惩罚系数,得到最优分类模型。后对三个模型的分类效果进行比较,logistic回归模型的预测正确率为77.38%,模型结果可解释性强;支持向量机和随机森林模型的预测正确率为78.43%和79.21%。结果显示:非线性模型分类效果优于线性模型。支持向量机、随机森林模型计算简单、运行效率高,对高维大数据学习、预测能力强,训练时间短,且随机森林模型更兼顾了可解释性,克服了模型过拟合的问题,在心脏病等医疗诊断中有很大的应用潜力。
This paper collects the real medical data as object, and uses the logistic regression, support vector ma- chine and random forest classification model to classify data. These data is analyzed and predicted, and logistic regres- sion is used to find factors that has a greater impact on heart disease, such as familial factors and smoking, and then makes recommendations. The optimal kernel function and penalty coefficient of support vector machine are found by cross validation method, and the optimal classification model is obtained. After comparing the classification results of the three models,the prediction accuracy of logistic regression model is 77.38%, and the model results are interpretable. The prediction accuracy rate of support vector machine and random forest model is 78.43% and 79.21%. Due to the insufficient amount of data, the logistic regression model predicts the correctness is slightly lower than the support vector machine and random forest algorithm, but the support vector machine model, random forest model still have the advanta- ges of simple calculation, high operation efficiency, high learning data, high forecasting ability and short training time. Moreover, random forest model takes into account the interpretability, it has great application potential in the heart dis- ease and other medical diagnosis.
作者
张冰洁
Zhang Bingjie(School of Statistics and Mathematics,Zhongnan University of Economics and Law,Wuhan 430073,China)
出处
《中南财经政法大学研究生学报》
2017年第6期18-26,共9页
Journal of the Postgraduate of Zhongnan University of Economics and Law