摘要
目的探索利用机器学习基于不平衡数据预测急性新发缺血性卒中患者的院内死亡风险,并比较机器学习模型和传统logistic模型的预测性能。方法以中国卒中联盟多中心登记数据库中急性新发缺血性卒中患者为研究对象,分别基于机器学习[XGBoost模型、CatBoost模型、随机森林模型、支持向量机(support vector machine,SVM)模型]和传统logistic方法构建患者院内死亡预测模型。按照7∶3比例随机分为训练集和测试集,训练集用于构建预测模型,测试集用于评价模型效果。采用欠采样技术和平衡权重的方法处理死亡结局的不平衡数据。模型的评价指标包括区分度指标ROC中AUC和校准度指标Brier分数。结果共纳入601466例急性新发缺血性卒中的患者,女性231235例(38.45%),院内死亡2206例(0.37%)。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、SVM模型预测患者院内死亡的AUC分别是0.913±0.000、0.921±0.000、0.919±0.001、0.925±0.000和0.900±0.001,其中XGBoost模型(P=0.0002)、CatBoost模型(P=0.0094)和随机森林模型(P<0.0001)的预测性能优于logistic模型,logistic模型表现优于SVM模型(P=0.0029)。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、SVM模型的Brier分数分别为0.115±0.001、0.096±0.001、0.093±0.001、0.084±0.000和0.045±0.001,机器学习模型的校准度均优于logistic模型,差异有统计学意义。结论平衡数据处理后,机器学习模型和传统logistic模型预测急性新发缺血性卒中患者院内死亡风险表现均良好且稳定,其中,随机森林模型的预测性能最佳,SVM模型的校准度最佳。
Objective To explore the value of machine learning based on unbalanced data to predict inhospital death in patients with acute new ischemic stroke,and compare the predictive performance of machine learning model and traditional logistic model.Methods Data of patients with new acute ischemic stroke from the multi-center registry database of Chinese Stroke Center Alliance(CSCA)were selected,to construct the prediction models of inhospital death based on machine learning[XGBoost,CatBoost,random forest and support vector machine(SVM)]and traditional logistic method,respectively.According to the ratio of 7:3,all the data were randomly divided into training set(to construct the prediction model)and test set(to evaluate the prediction model).The unbalanced data of death outcome were dealed with the undersampling and balancing weight methods. The AUC of the discrimination index and the Brierscore of the calibration index were used to evaluate the models.Results A total of 601 466 eligible patients were included, including 231 235 females (38.45%)and 2206 in-hospital deaths (0.37%). The AUC of the logistic model, XGBoost model, CatBoostmodel, random forest model and SVM model to predict in-hospital death were 0.913±0.000,0.921±0.000, 0.919±0.001, 0.925±0.000 and 0.900±0.001, respectively. The XGBoost model(P =0.0002), CatBoost model (P =0.0094) and random forest model (P <0.0001) had better predictionperformance than the logistic model, and the logistic model was better than the SVM model(P =0.0029). The Brier scores of the logistic model, XGBoost model, CatBoost model, randomforest model, and SVM model were 0.115±0.001, 0.096±0.001, 0.093±0.001, 0.084±0.000 and0.045±0.001, respectively. The calibration of machine learning models was all better than thelogistic model, and all the differences were statistically significant.Conclusions After balancing the data, machine learning models and the traditional logisticmodel all had a good and stable performance in predicting the risk of in-hospital death in patientswith acute new ischemic stroke. Among them, the random forest model had the best predictiveperformance and the SVM model had the best calibration.
作者
陈思玎
谷鸿秋
黄馨莹
刘欢
姜勇
王拥军
CHEN Si-Ding;GU Hong-Qiu;HUANG Xin-Ying;LIU Huan;JIANG Yong;WANG Yong-Jun(China National Clinical Research Center for Neurological Diseases,Beijing 100070,China;Beijing Advanced Innovation Center for Big Data-Based Precision Medicine(Beihang University&Capital Medical University),Beijing 100091,China;Department of Neurology,Beijing Tiantan Hospital,Capital Medical University,Beijing 100070,China;Beijing Institute for Brain Disorders,Beijing 100070,China)
出处
《中国卒中杂志》
2021年第8期779-786,共8页
Chinese Journal of Stroke
基金
“十三五”国家重点研发计划(2016YFC0901001)。
关键词
缺血性卒中
院内死亡
预测模型
机器学习
Ischemic stroke
In-hospital death
Prediction model
Machine learning