摘要
目的比较不同机器学习算法构建疾病预测模型,探讨冠状动脉粥样硬化性心脏病(coronary artery disease,冠心病,CAD)的危险因素的重要性。方法选取2014年5月至7月于北京大学首钢医院就诊体检的人群作为受试者,包括冠心病患者345例,非冠心病对照组2368例,收集受试者的一般情况和体检数据,进行卡方检验分析,之后采用logistics回归(LR)、随机森林(RF)两种机器学习算法构建疾病预测模型,并进一步通过网格交叉验证法优化超参数,提升模型整体性能。结果冠心病组的年龄、体质指数(BMI)、收缩压(SBP)和舒张压(DBP)、脉搏波传导速度(PWV)、血清肌酐(Scr)、随机血糖(FPG)等指标水平,P均<0.01显著高于对照组;且冠心病组的糖尿病,高血压和高血脂的患病时长(P<0.01)、颈动脉超声、疲乏频率等情况较对照组更严重,低密度脂蛋白胆固醇(LDL-C,P<0.01)、高密度脂蛋白胆固醇(HDL-C,P<0.01)以及每日睡眠时长(P<0.01)显著低于对照组,上述各项指标的差异均具有统计学意义。应用LR和RF两种机器学习算法构建CAD风险预测模型,在未进行超参数调优前,AUC值和特异度较高,但灵敏度极低;调优后AUC值和特异度变化较小,灵敏度提升较大,极大地提升了模型临床应用价值。在两种机器学习算法构建的冠心病预测模型中,重要度排名均为前十的危险因素为:年龄、疲乏频率、Scr和PWV、动脉超声和高血脂患病时长。结论年龄、疲乏频率、Scr和PWV、动脉超声阳性和高血脂患病时长等五项危险因素对冠心病预测模型的构建影响较大。类别不平衡的疾病数据集可通过超参数调优的机器学习算法构建临床预测模型。
Objective To explore the importance of coronary artery disease(CAD)risk factors and construct the CAD prediction model.Methods Two thousand seven hundred thirteen participants were enrolled from May to July 2014 at Peking University Shougang Hospital,including 345 CAD cases and 2368 control cases.Logistic Regression(LR)and Random Forest(RF)were used to construct the prediction model,and the hyperparameters were optimized using grid cross-validation.Results In the CAD group,age,BMI,blood pressure(BP),pulse wave velocity(PWV),serum creatinine(Scr,),and fasting plasma glucose(FPG,P<0.01)levels were higher than those in the control group.The duration of diabetes,hypertension,hyperlipidemia(P<0.01),arterial ultrasound(P<0.01),and fatigue(P<0.01)were more severe in the CAD group.The levels of low-density lipoprotein cholesterol(LDL-C,P<0.01),high-density lipoprotein cholesterol(HDL-C,P<0.01),and sleep duration(P<0.01)were lower in the CAD group.The LR and RF,machine learning algorithms produced CAD risk prediction models with a high area under the curve(AUC)and specificity but low sensitivity before adjusting the parameters.After tuning the parameters,the AUC value changed little,and the sensitivity increased significantly.The top 10 essential risk factors for both CAD risk models were age,duration of hyperlipidemia,duration of diabetes,fatigue,Scr,and PWV.Conclusion Five risk factors of Age,Fatigue Frequency,Scr,and PWV,Positive arterial ultrasound,and duration of hyperlipidemia prevalence strongly influence the construction of CAD prediction models.Class-imbalanced disease datasets enable the construction of clinical prediction models by hyperparameter-tuned machine-learning algorithms.
作者
王成龙
何新叶
李燕奇
王舒
东黎光
王淑玉
张红叶
张宇清
周宪梁
刘力生
胡爱华
Wang Chenglong;He Xinye;Li Yanqi;Wang Shu;Dong Liguang;Wang Shuyu;Zhang Hongye;Zhang Yuqing;Zhou Xianliang;Liu Lisheng;Hu Aihua(Pediatric Chronic Disease Management Center,Beijing Children's Hospital,Capital Medical University,National Center for Children's Health,Beijing,100045,China;不详)
出处
《中国循证心血管医学杂志》
2023年第12期1334-1337,共4页
Chinese Journal of Evidence-Based Cardiovascular Medicine
关键词
冠心病
危险因素
超参数调优
预测模型
Coronary artery disease(CAD)
Risk factors
Hyperparameter tuning
Prediction model