摘要
探索机器学习算法在脂肪肝风险预测模型的应用,为脂肪肝易发人群健康管理及风险评估提供参考。选取2006—2016年在西部战区总医院健康体检中心定期健康体检的人群作为研究对象(体检中心为该类人群建有专门软件用于管理体检数据资料),建立体检数据的纵向队列。为提高模型的预测准确性和效率,采用Logistic回归模型先剔除无显著影响的特征,再基于决策树、XGBoost、Bagging、随机森林、人工神经网络和支持向量机6种机器学习算法建立脂肪肝预测模型。西部战区总医院连续11年总参检24106人次,近年来脂肪肝检出率呈逐年攀升趋势,计数资料采用卡方趋势性检验(χ^(2)=228.71,P<0.001)。XGBoost集成算法在模拟实验中F-measure值最大,标准均方误差最小。实例分析中XGBoost算法建立的脂肪肝预测模型ROC曲线下面积为0.958,召回率为0.790,精确率为0.761,准确率为0.898,均高于其他机器学习算法。传统Logistic回归模型ROC曲线下面积为0.732,也远小于XGBoost预测模型,二者比较差异有统计学意义(P<0.05)。由此可见利用XGBoost集成算法建立脂肪肝预测模型具有更好的预测性能。
The purpose of this paper is to explore the application of machine learning algorithm in risk prediction model of fatty liver disease,and to provide reference for health management and risk assessment of fatty liver prone population.The subjects of the study were selected from the health examination center of the General Hospital of the Western Theater Command(formerly the General Hospital of the Chengdu Military Command)who received regular physical examination from 2006 to 2016.(The physical examination center has a special software for the management of physical examination data for these people),and a longitudinal cohort of physical examination data was established.In order to improve the prediction accuracy and efficiency of the model,Logistic regression was used to eliminate the features with no significant influence,and then the prediction model of fatty liver was established based on six machine learning algorithms,including decision tree,XGBoost,Bagging,random forest,artificial neural network and support vector machine.The detection rate of fatty liver was increasing year by year in recent years,which used Chi-square test(χ2=228.71,P<0.001).In the simulation experiment,the XGBoost integrated algorithm has the largest F-measure value and the minimum standard mean square error.In the case analysis,the area under the ROC curve of the fatty liver prediction model established by XGBoost algorithm was 0.958,the recall rate was 0.790,the accuracy rate was 0.761,and the accuracy rate was 0.898,which were all higher than other machine learning algorithms.The area under the ROC curve of the traditional Logistic regression model was 0.732,which was also much smaller than that of the XGBoost prediction model,and the difference between the two models was statistically significant(P<0.05).It can be seen that the predictive model of fatty liver based on XGBoost integrated algorithm has better predictive performance.
作者
雷丽
李运明
LEI Li;LI Yunming(College of Mathematics,Southwest Jiaotong University,Chengdu 611756,China;Department of Medical Management,Division of Health Services,The General Hospitalof Western Theater Command,Chengdu 610083,China)
出处
《甘肃科学学报》
2022年第3期16-20,37,共6页
Journal of Gansu Sciences
基金
全军医学科技青年培育项目(17QNP047)
西部战区总医院军事医学科研课题(2019ZY10,2019ZY04)。