大数据分析下不完备数据多重准确填补仿真被引量：3

Incomplete Data Multiple Precision Filling Simulation under Big Data Analysis

下载PDF

导出

摘要对大数据分析下的不完备数据进行填补,能够有效提高数据的利用率。对不完备数据进行多重准确填补,需要计算所有数据的向量属性均值与标准差,并将不完备数据填补模拟中重复应用。传统方法对数据填补变量间关系予以考虑,根据与待填补数据之间的相关性完成缺失填补,但忽略了计算所有数据的标准差,导致填补效率低。提出基于logistic的大数据分析下不完备数据多重准确填补方法。对所有数据向量属性值均值与标准差进行计算,采用估计的形式得到数据平均向量与协方差函数,并对各观察对象缺失值进行独立模拟填补,通过logistic回归模型选择存在缺失值的变量所需填补值,得到完备数据。重新估计数据平均向量与协方差函数,并将其在不完备数据填补模拟中重复应用。对上述过程进行迭代,直到达到迭代条件,将不完备数据多重填补结果输出。实验表明,上述方法填补效率较高,可为该领域研究发展奠定基础。 Traditional method ignores to calculate all the standard deviation of data, resulting in the low efficiency. This paper focuses on a multiple accuracy imputation method for incomplete data based on Logistic in big data analysis. Firstly, mean and standard deviation of all data vector attribute values were calculated, and then the form of estimation was used to obtain average vector and covariance function of data. Meanwhile, the missing value of each observed object was simulated and filled independently. Moreover, the logistic regression model was used to choose the imputed data needed by variable with the missing value, and then get the complete data. In addition, average vector and covariance function of data were estimated again, which were applied to simulation of incomplete data imputation repeatedly. Finally, the above process was iterated until reaching the iteration condition. Thus, the result of multiple imputations of incomplete data was output. Simulations prove that the proposed method has high-efficient data imputation, which can lay the foundation for the research and development in this field.

作者王丽雯黄旭 WANG Li-wen;HUANG Xu(Xi'an University of Science and Technology,Xi'an Shanxi 710054,China)

机构地区西安科技大学

出处《计算机仿真》北大核心 2019年第7期367-370,共4页 Computer Simulation

关键词大数据分析不完备数据多重填补 Big data analysis Incomplete data Multiple imputations

分类号 TP274 [自动化与计算机技术—检测技术与自动化装置]

引文网络
相关文献

参考文献10

1张晓琴,程誉莹.基于随机森林模型的成分数据缺失值填补法[J].应用概率统计,2017,33(1):102-110. 被引量：32
2鲍晓蕾,高辉,胡良平.多种填补方法在纵向缺失数据中的比较研究[J].中国卫生统计,2016,33(1):45-48. 被引量：17
3郑奇斌,刁兴春,曹建军,周星,许永平.结合局部敏感哈希的k近邻数据填补算法[J].计算机应用,2016,36(2):397-401. 被引量：4
4韩飞,沈镇林.基于不完备集双聚类的缺失数据填补算法[J].计算机工程,2016,42(4):20-26. 被引量：12
5杨杰,杨虎,王鲁滨,金鑫,郭华,于亮亮.高维相关性缺失数据的分块填补算法研究[J].计算机科学与探索,2017,11(10):1557-1569. 被引量：6
6王妍,王凤桐,王俊陆,宋宝燕,石展.基于泛化中心聚类的不完备数据集填补方法[J].小型微型计算机系统,2017,38(9):2017-2021. 被引量：11
7郑李玲,秦永松,李英华.-混合样本下缺失数据情形线性模型回归系数的经验似然比统计量的渐近分布[J].工程数学学报,2017,34(2):171-181. 被引量：5
8王军,李建勋,韩山,王兴.一种效能评估中缺失数据的填充方法[J].上海交通大学学报,2017,51(2):180-185. 被引量：7
9陈俊,龙东,杨舟,韦杏秋.基于组合优化LOWESS的电能量数据缺失处理方法[J].电测与仪表,2017,54(3):31-34. 被引量：6
10金哲凡,俞定国,杨浩,周忠成.高并发环境下网络信息缺失数据修复方法仿真[J].计算机仿真,2017,34(9):374-377. 被引量：7

二级参考文献89

1金连,王宏志,黄沈滨,高宏.基于Map-Reduce的大数据缺失值填充算法[J].计算机研究与发展,2013,50(S1):312-321. 被引量：18
2武建虎,贺佳,贺宪民,程红岩.多变量缺失数据的不同处理方法及分析结果比较[J].第二军医大学学报,2004,25(9):1013-1016. 被引量：17
3李序颖.基于空间自回归模型的缺失值插补方法[J].数理统计与管理,2005,24(3):45-50. 被引量：9
4秦永松,零东宇,姜波.-混合样本下含附加信息时条件分位数估计的渐近性质(英文)[J].应用数学,2005,18(3):432-440. 被引量：1
5刘富春.基于限制容差关系的集对粗糙集模型[J].计算机科学,2005,32(6):124-128. 被引量：8
6赵飞,刘奇志,张剡,柏文阳.一种大域数据流中缺失值的填充方法[J].南京大学学报（自然科学版）,2011,47(1):32-39. 被引量：4
7Rajaraman A,Ullman J D.大数据:互联网大规模数据挖掘与分布式处理[M].王斌,译.北京:人民邮电出版社,2012:150-155.
8GARCIA-LAENCINA P J, SANCHO-GOMEZ J-L, FIGUEIRAS-VIDAL A R, et al. K nearest neighbors with mutual information for simultaneous classification and missing data imputation[J]. Neurocomputing, 2009, 72(7/8/9): 1483-1493.
9WANG H, WANG S. Discovering patterns of missing data in survey databases: An application of rough sets[J]. Expert System with Applications, 2009, 36(3): 6256-6260.
10LITTLE R J A, RUBIN D B. Statistical analysis with missing data[M]. New York: John Wiley & Sons, 2002: 19-20.