摘要
在医疗领域,普遍存在的数据缺失现象会加剧构建临床预测模型的难度.针对某些具有重要医学价值的特征因数据缺失率较高而被丢弃的问题,提出基于互信息加权的K近邻填补算法(Weighted KNN Imputation Algorithm Based on Mutual Information,MIW-KNN).首先,在心力衰竭合并艰难梭菌感染患者的数据集上,与多重插补法、K近邻(K-nearest neighbor,KNN)填补法、均值法等方法进行对比验证所提出方法的有效性.其次,对比不同模型的死亡风险预测效果以验证所提出方法的性能优势.通过单变量分析法所筛选的20个特征,根据9种机器学习算法分别构建预测模型.采用AUC(Area Under the Receiver Operating Characteristic Curve)与准确率作为主要指标以评估模型的性能,通过SHAP(Shapley Additive Explanations)解释分析不同临床特征对模型的影响.最终表明,MIW-KNN算法具有最高的填补精度,基于该方法填补的数据集所构建的随机森林模型实现了最佳的预测性能.AUC为0.841,准确率为0.821.SHAP显示红细胞宽度、晶体输注、白细胞计数是最具影响力的前三个特征.
In the medical field,the prevalence of missing data exacerbates the difficulty of building clinical predictive models.In order to solve the problem of some medically important features being discarded due to high data missing rate,a weighted KNN imputation algorithm based on mutual information(MIW-KNN)was proposed.Firstly,on the dataset of patients with heart failure complicated with Clostridioides difficile infection,the MIW-KNN was compared with multiple interpolation,K nearest neighbor(KNN)interpolation,mean method to verify the effectiveness.Secondly,the prediction effects of different models were compared to verify the performance advantage of the proposed method.Through the 20 features screened by univariate analysis,the prediction model was constructed according to nine machine learning algorithms.AUC(Area Under the Receiver Operating Characteristic Curve)and accuracy were used as the main indexes to evaluate the model performance.SHAP(Shapley Additive Explanations)was used to explain and analyze the influence of different clinical features on the model.The results showed that the MIW-KNN algorithm had the highest effect,and the random forest model based on the data set filled by this method achieved the best prediction performance.The AUC was 0.841 and the accuracy was 0.821.SHAP showed that red blood cell width,crystal infusion,and white blood cell count were the top three most influential features.
作者
石彩萍
胡真
SHI Caiping;HU Zhen(School of Mathematics,Hohai University,Nanjing 211100,China)
出处
《河南科学》
2024年第10期1422-1433,共12页
Henan Science
基金
中央高校基本科研业务费专项基金资助项目(B200202002)。
关键词
KNN算法
互信息
死亡率预测
机器学习
心力衰竭
艰难梭菌感染
KNN algorithm
mutual information
mortality prediction
machine learning
heart failure
Clostridioides difficile infection