机器学习基于不平衡数据预测急性新发缺血性卒中患者院内死亡模型研究

Machine Learning Models for Predicting In-hospital Death in Patients with Acute New Ischemic Stroke Based on Unbalanced Data

下载PDF

导出

摘要目的探索利用机器学习基于不平衡数据预测急性新发缺血性卒中患者的院内死亡风险,并比较机器学习模型和传统logistic模型的预测性能。方法以中国卒中联盟多中心登记数据库中急性新发缺血性卒中患者为研究对象,分别基于机器学习[XGBoost模型、CatBoost模型、随机森林模型、支持向量机(support vector machine,SVM)模型]和传统logistic方法构建患者院内死亡预测模型。按照7∶3比例随机分为训练集和测试集,训练集用于构建预测模型,测试集用于评价模型效果。采用欠采样技术和平衡权重的方法处理死亡结局的不平衡数据。模型的评价指标包括区分度指标ROC中AUC和校准度指标Brier分数。结果共纳入601466例急性新发缺血性卒中的患者,女性231235例(38.45%),院内死亡2206例(0.37%)。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、SVM模型预测患者院内死亡的AUC分别是0.913±0.000、0.921±0.000、0.919±0.001、0.925±0.000和0.900±0.001,其中XGBoost模型(P=0.0002)、CatBoost模型(P=0.0094)和随机森林模型(P<0.0001)的预测性能优于logistic模型,logistic模型表现优于SVM模型(P=0.0029)。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、SVM模型的Brier分数分别为0.115±0.001、0.096±0.001、0.093±0.001、0.084±0.000和0.045±0.001,机器学习模型的校准度均优于logistic模型,差异有统计学意义。结论平衡数据处理后,机器学习模型和传统logistic模型预测急性新发缺血性卒中患者院内死亡风险表现均良好且稳定,其中,随机森林模型的预测性能最佳,SVM模型的校准度最佳。 Objective To explore the value of machine learning based on unbalanced data to predict inhospital death in patients with acute new ischemic stroke,and compare the predictive performance of machine learning model and traditional logistic model.Methods Data of patients with new acute ischemic stroke from the multi-center registry database of Chinese Stroke Center Alliance(CSCA)were selected,to construct the prediction models of inhospital death based on machine learning[XGBoost,CatBoost,random forest and support vector machine(SVM)]and traditional logistic method,respectively.According to the ratio of 7:3,all the data were randomly divided into training set(to construct the prediction model)and test set(to evaluate the prediction model).The unbalanced data of death outcome were dealed with the undersampling and balancing weight methods. The AUC of the discrimination index and the Brierscore of the calibration index were used to evaluate the models.Results A total of 601 466 eligible patients were included, including 231 235 females (38.45%)and 2206 in-hospital deaths (0.37%). The AUC of the logistic model, XGBoost model, CatBoostmodel, random forest model and SVM model to predict in-hospital death were 0.913±0.000,0.921±0.000, 0.919±0.001, 0.925±0.000 and 0.900±0.001, respectively. The XGBoost model(P =0.0002), CatBoost model (P =0.0094) and random forest model (P <0.0001) had better predictionperformance than the logistic model, and the logistic model was better than the SVM model(P =0.0029). The Brier scores of the logistic model, XGBoost model, CatBoost model, randomforest model, and SVM model were 0.115±0.001, 0.096±0.001, 0.093±0.001, 0.084±0.000 and0.045±0.001, respectively. The calibration of machine learning models was all better than thelogistic model, and all the differences were statistically significant.Conclusions After balancing the data, machine learning models and the traditional logisticmodel all had a good and stable performance in predicting the risk of in-hospital death in patientswith acute new ischemic stroke. Among them, the random forest model had the best predictiveperformance and the SVM model had the best calibration.

作者陈思玎谷鸿秋黄馨莹刘欢姜勇王拥军 CHEN Si-Ding;GU Hong-Qiu;HUANG Xin-Ying;LIU Huan;JIANG Yong;WANG Yong-Jun(China National Clinical Research Center for Neurological Diseases,Beijing 100070,China;Beijing Advanced Innovation Center for Big Data-Based Precision Medicine(Beihang University&Capital Medical University),Beijing 100091,China;Department of Neurology,Beijing Tiantan Hospital,Capital Medical University,Beijing 100070,China;Beijing Institute for Brain Disorders,Beijing 100070,China)

机构地区国家神经系统疾病临床医学研究中心北京大数据精准医疗高精尖创新中心(北京航空航天大学&首都医科大学) 北京首都医科大学附属北京天坛医院神经病学中心北京脑重大疾病研究院脑卒中研究所

出处《中国卒中杂志》 2021年第8期779-786,共8页 Chinese Journal of Stroke

基金 “十三五”国家重点研发计划(2016YFC0901001)。

关键词缺血性卒中院内死亡预测模型机器学习 Ischemic stroke In-hospital death Prediction model Machine learning

分类号 R743.3 [医药卫生—神经病学与精神病学] TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献4

1陈旭,刘鹏鹤,孙毓忠,沈曦,张磊,王晓青,孙晓平,程伟.面向不均衡医学数据集的疾病预测模型研究[J].计算机学报,2019,42(3):596-609. 被引量：45
2柳培忠,洪铭,黄德天,骆炎民,王守觉.基于ADASYN与AdaBoostSVM相结合的不平衡分类算法[J].北京工业大学学报,2017,43(3):368-375. 被引量：10
3Yongjun Wang,Zixiao Li,Yilong Wang,Xingquan Zhao,Liping Liu,Xin Yang,Caiyun Wang,Hongqiu Gu,Fuying Zhang,Chunjuan Wang,Ying Xian,David Z Wang,Qiang Dong,Anding Xu,Jizong Zhao,Chinese Stroke Center Alliance investigators.Chinese Stroke Center Alliance:a national effort to improve healthcare quality for acute stroke and transient ischaemic attack:rationale,design and preliminary findings[J].Stroke & Vascular Neurology,2018,3(4):256-262. 被引量：42
4方匡南,吴见彬,朱建平,谢邦昌.随机森林方法研究综述[J].统计与信息论坛,2011,26(3):32-38. 被引量：704

二级参考文献45

1刘微,罗林开,王华珍.基于随机森林的基金重仓股预测[J].福州大学学报（自然科学版）,2008,36(S1):134-139. 被引量：8
2林成德,彭国兰.随机森林在企业信用评估指标体系确定中的应用[J].厦门大学学报（自然科学版）,2007,46(2):199-203. 被引量：38
3Breiman L. Bagging Preditors [J].Machine Learning, 1996,24(2).
4Dietterich T. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization [J].Machine Learning, 2000,40(2).
5Ho T K. The Random Subspace Method for Constructing Decision Forests [J].Trans. on Pattern Analysis and Machine Intelligence, 1998,20 (8).
6Amit Y, Gernan D. Shape Quantization and Recognition with Randomized Trees[J]. Neural Computation, 1997,9(7). Breiman L Random Forest[J]. Machine Learning, 2001,45(1).
7Breiman L. Random Forests[J]. Machine Learning, 2001,45(1).
8Tibshirani tL Bias, Variance, and Prediction Error for Classification Rules[C]. Technical Report, Statistics Department, University of Toronto, 1996.
9Wolpert D H, Macready W G. An Efficient Method to Estimate Bagging's Generalization Error[J]. Machine Learning, 1999,35(1).
10Breiman L. Out-of-bag Estimation[EB/OL]. [2010- 06- 30]. http//stat, berkeley, edu/ pub/ users/ breiman / OOB estimation, ps.

共引文献796

1郑伟,戴伊宁,孙楠楠,尹乔乔,吴青青,惠田辰,吴文昊,黄海军,童永喜,黄益澄,汪明珊,陈美娟,张家杰,严蓉,高海女,潘红英.应用随机森林模型和Logistic回归模型分析COVID-19的影响因素[J].预防医学,2021,33(7):722-725. 被引量：1
2袁鸷慧,聂胜,张合兵,王成,王宏涛,习晓环.GEDI地面高程和森林冠层高度的精度评价与影响分析[J].遥感技术与应用,2022,37(5):1056-1070. 被引量：2
3向菲,谢耀谈.基于混合采样与迁移学习的患者评论识别模型[J].数据分析与知识发现,2020,4(2):39-47. 被引量：2
4谢春,许伟.基于随机森林回归算法的锅炉沾污因数预测方法[J].上海电气技术,2022,15(1):29-32. 被引量：2
5王仁超,朱品光.基于随机森林回归方法的爆破块度预测模型研究[J].水力发电学报,2020,39(1):89-101. 被引量：24
6杨龙,王闻娟,覃哲,古悦璇.中国大学生气候认知与低碳行为及其影响因素研究——基于随机森林模型分析[J].文化与传播,2022,11(2):6-15. 被引量：2
7饶贵川,王雅楠,华伟平,林维晟,潘俊忠,廖佩莹.环境因子对人工森林蓄积量影响的机器学习分析[J].林业科技通讯,2023(12):58-63.
8王治忠,闫文明,王松伟.基于鸽子视顶盖神经元响应对不同颜色背景字符图像的重建研究[J].计算机应用研究,2020,37(1):308-312.
9宋华丽,陈欣影,王鹏,初军玲,丛源.基于随机森林的江淮各省会城市夏季降水量预报对比分析[J].湖北农业科学,2019,58(S02):190-197.
10张杜娟,苏曦.基于改进极限学习机的疾病预测研究[J].电子测量技术,2020(9):56-60. 被引量：1

1王瑞,尹红,强冰冰.基于改进XGBoost的企业员工离职预测模型[J].信息技术,2021,45(8):12-15. 被引量：1
2林杨,杨伟光,张滨,王丽丽,李婷婷,王旭红.不同严重程度糖尿病足细菌感染特征与下肢血管病变及病情结局的关系[J].河北医科大学学报,2021,42(7):784-788. 被引量：9
3张冬娟,黄荣东,林光灿.福建省乙型肝炎疫苗免疫效果类实验流行病研究[J].中国生物制品学杂志,2021,34(8):974-978. 被引量：2

中国卒中杂志

2021年第8期

浏览历史

内容加载中请稍等...

机器学习基于不平衡数据预测急性新发缺血性卒中患者院内死亡模型研究

参考文献4

二级参考文献45

共引文献796

相关作者

相关机构

相关主题

浏览历史