摘要
为解决医疗数据集中数据缺失对分类器的性能以及下游任务产生的不利影响,提出使用缺失森林插补法对医疗数据集中缺失值进行插补。该方法首先采用数据集中完整数据的观测值训练一个随机森林模型;利用训练好的随机森林模型预测缺失数据;不断重复迭代上述过程,从而完成数据缺失值补全。在两个医学数据集上进行测试,结果表明,根据NRMSE(Normalized Root Mean Squared Error)和PFC(the Proportion of Falsely Classified)评估指标,缺失森林插补法误差较低,插补效果优于K最近邻插补法、多重插补法和GAIN(Generative Adversarial Imputation Nets)插补法。同时,使用糖尿病数据集通过分析谷丙转氨酶(ALT:ALanine aminoTransferase)与糖尿病剂量反应关系证明了缺失森林插补法的稳定性。
To address the adverse effects of missing data in the medical dataset on the performance of the classifier and on downstream tasks.We use the missing forest interpolation method to interpolate missing values in medical datasets.The method first trains a random forest model with observations of complete data in the dataset.Then the trained random forest model is used to predict the missing data.Finally,the above process is repeated iteratively to complete the missing data interpolation.On two medical datasets,according to NRMSE(Normalized Root Mean Squared Error)and PFC(the Proportion of Falsely Classified)evaluation metrics,the missing forest interpolation method has lower error and better interpolation than K-nearest neighbor interpolation,multiple interpolation and GAIN(Generative Adversarial Imputation Nets)interpolation.The stability of the missing forest interpolation method is demonstrated by analyzing the relationship between glutamate aminotransferase(ALT:ALanine aminoTransferase)and diabetes dose-response using the diabetes dataset.
作者
白洪涛
栾雪
何丽莉
毕亚茹
张婷婷
孙成林
BAI Hongtao;LUAN Xue;HE Lili;BI Yaru;ZHANG Tingting;SUN Chenglin(College of Software,Jilin University,Changchun 130022,China;College of Conmputer Science and Technology,Jilin University,Changchun 130022,China;First Hospital,Jilin Univerity,Changchun 130012,China)
出处
《吉林大学学报(信息科学版)》
CAS
2022年第4期616-620,共5页
Journal of Jilin University(Information Science Edition)
基金
国家重点研发计划基金资助项目(2017YFC1309805)
吉林省科技厅自然科学基金资助项目(20210101181JC)。
关键词
缺失数据插补
缺失森林插补法
大数据
ALT与糖尿病剂量-反应
missing data interpolation
missing forest interpolation
big data
alanine amino transferase(ALT)and diabetes dose-response