期刊文献+

Spark环境下不完整数据集成填充方法 被引量:5

Integrated Imputation Method of Incomplete Data in Spark
下载PDF
导出
摘要 目前已有的不完整数据填充方法大多局限于单一类型的缺失变量,对大规模数据的填充效果相对弱势.为了解决真实大数据中混合类型变量的缺失问题,本文提出了一个新的模型——SXGBI(Spark-based eXtreme Gradient Boosting Imputation),其适应于连续型和分类型两种缺失变量并存的不完整数据填充,同时具备快速处理大数据的泛化能力.该方法通过对集成学习方法XGBoost的改进,将多种补全算法结合在一起,构建了一个集成学习器,并结合Spark分布式计算框架进行了并行化设计,能较好地运行于Spark分布式集群上.实验表明,随着缺失率的增长,SXGBI在RMSE、PFC和F1几项评价指标上都取得了比实验中其它填充方法更好的填充结果.此外,它还可以有效地运用在大规模的数据集上. At present,the existing imputation methods for incomplete data are mostly limited to a single type of missing variables,and the filling effect of large-scale data is relatively weak.In order to cope with the problem of mixed-type variables missing in real big data,this paper proposes a novel model which is suitablefor both continuous and categorical data,contains strong generalization capabilities and can scale up to exceedingly large datasets.Hence,we propose SXGBI(Spark-based eXtreme Gradient Boosting Imputation),a method which combines multiple imputation algorithms to construct an integrated learner by improving an ensemble learning method——XGBoost.With the parallel design of Spark distributed computing framework,XGBoost can run well on Spark distributed cluster.Comparing with existing filling methods,this assumption proves to be powerful since extensive experiments demonstrate that SXGBIcan still achieve better results in RMSE,PFC and F1 than other imputation methods with the increase of the missing rate.Besides,it can be successfully trained on a large-scale dataset.
作者 邹萌萍 彭敦陆 ZOU Meng-ping;PENG Dun-lu(School of Optional-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2021年第1期111-116,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61772342,61703278)资助。
关键词 SPARK XGBoost 不完整数据填充 混合型变量 Spark XGBoost incomplete data imputation mixed-type variables
  • 相关文献

参考文献3

二级参考文献21

共引文献18

同被引文献49

引证文献5

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部