期刊文献+

基于缺失森林的医疗大数据缺失值插补 被引量:7

Missing Value Interpolation for Medical Big Data Based on Missing Forest
下载PDF
导出
摘要 为解决医疗数据集中数据缺失对分类器的性能以及下游任务产生的不利影响,提出使用缺失森林插补法对医疗数据集中缺失值进行插补。该方法首先采用数据集中完整数据的观测值训练一个随机森林模型;利用训练好的随机森林模型预测缺失数据;不断重复迭代上述过程,从而完成数据缺失值补全。在两个医学数据集上进行测试,结果表明,根据NRMSE(Normalized Root Mean Squared Error)和PFC(the Proportion of Falsely Classified)评估指标,缺失森林插补法误差较低,插补效果优于K最近邻插补法、多重插补法和GAIN(Generative Adversarial Imputation Nets)插补法。同时,使用糖尿病数据集通过分析谷丙转氨酶(ALT:ALanine aminoTransferase)与糖尿病剂量反应关系证明了缺失森林插补法的稳定性。 To address the adverse effects of missing data in the medical dataset on the performance of the classifier and on downstream tasks.We use the missing forest interpolation method to interpolate missing values in medical datasets.The method first trains a random forest model with observations of complete data in the dataset.Then the trained random forest model is used to predict the missing data.Finally,the above process is repeated iteratively to complete the missing data interpolation.On two medical datasets,according to NRMSE(Normalized Root Mean Squared Error)and PFC(the Proportion of Falsely Classified)evaluation metrics,the missing forest interpolation method has lower error and better interpolation than K-nearest neighbor interpolation,multiple interpolation and GAIN(Generative Adversarial Imputation Nets)interpolation.The stability of the missing forest interpolation method is demonstrated by analyzing the relationship between glutamate aminotransferase(ALT:ALanine aminoTransferase)and diabetes dose-response using the diabetes dataset.
作者 白洪涛 栾雪 何丽莉 毕亚茹 张婷婷 孙成林 BAI Hongtao;LUAN Xue;HE Lili;BI Yaru;ZHANG Tingting;SUN Chenglin(College of Software,Jilin University,Changchun 130022,China;College of Conmputer Science and Technology,Jilin University,Changchun 130022,China;First Hospital,Jilin Univerity,Changchun 130012,China)
出处 《吉林大学学报(信息科学版)》 CAS 2022年第4期616-620,共5页 Journal of Jilin University(Information Science Edition)
基金 国家重点研发计划基金资助项目(2017YFC1309805) 吉林省科技厅自然科学基金资助项目(20210101181JC)。
关键词 缺失数据插补 缺失森林插补法 大数据 ALT与糖尿病剂量-反应 missing data interpolation missing forest interpolation big data alanine amino transferase(ALT)and diabetes dose-response
  • 相关文献

参考文献5

二级参考文献31

  • 1岳勇,田考聪.数据缺失及其填补方法综述[J].预防医学情报杂志,2005,21(6):683-685. 被引量:30
  • 2Rubin D.Inference and missing data[J]. Biometrika,1976,63(3):581-592.
  • 3Little RJA,Rubin DB.Statistical Analysis with Missing Data[M].New York:Wiley and Sons,Inc.1987.
  • 4Nordheim EV.Inference from nonrandomly missing data:An example from a genetic study on Turner' s Syndrome [J].Am Statist Assoc,1984,79:772-780.
  • 5Horton NJ,Laird NM.Maximum likehood analysis of generalized linear models with missing covariates [J].Statist Meth Med Res,1988,8(1):37-50.
  • 6Allison PD.Multiple imputation for missing data:A cautionary tale [J].Sociological Methods and Research,2000,28(3):301-309.
  • 7Bello AL.Imputation techniques in regression analysis:Looking closely at their implementation [J].Computational Statistics and Data Analysis,1995,20:45-57.
  • 8Rao JNK,Shao J.Jackknife variance estimation with survey data under hot deck imputation [J].Biometrika,1992,79:811-822.
  • 9Rubin DB.Multiple imputations in sample surveys [J].Am Statist Assoc,1978:20-34.
  • 10Meng XL,Rubin DB.Performing likelihood ration tests with multiple imputed data sets [J ].Biometrika,1992,79 (1):103-111.

共引文献51

同被引文献70

引证文献7

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部