摘要
对大数据分析下的不完备数据进行填补,能够有效提高数据的利用率。对不完备数据进行多重准确填补,需要计算所有数据的向量属性均值与标准差,并将不完备数据填补模拟中重复应用。传统方法对数据填补变量间关系予以考虑,根据与待填补数据之间的相关性完成缺失填补,但忽略了计算所有数据的标准差,导致填补效率低。提出基于logistic的大数据分析下不完备数据多重准确填补方法。对所有数据向量属性值均值与标准差进行计算,采用估计的形式得到数据平均向量与协方差函数,并对各观察对象缺失值进行独立模拟填补,通过logistic回归模型选择存在缺失值的变量所需填补值,得到完备数据。重新估计数据平均向量与协方差函数,并将其在不完备数据填补模拟中重复应用。对上述过程进行迭代,直到达到迭代条件,将不完备数据多重填补结果输出。实验表明,上述方法填补效率较高,可为该领域研究发展奠定基础。
Traditional method ignores to calculate all the standard deviation of data, resulting in the low efficiency. This paper focuses on a multiple accuracy imputation method for incomplete data based on Logistic in big data analysis. Firstly, mean and standard deviation of all data vector attribute values were calculated, and then the form of estimation was used to obtain average vector and covariance function of data. Meanwhile, the missing value of each observed object was simulated and filled independently. Moreover, the logistic regression model was used to choose the imputed data needed by variable with the missing value, and then get the complete data. In addition, average vector and covariance function of data were estimated again, which were applied to simulation of incomplete data imputation repeatedly. Finally, the above process was iterated until reaching the iteration condition. Thus, the result of multiple imputations of incomplete data was output. Simulations prove that the proposed method has high-efficient data imputation, which can lay the foundation for the research and development in this field.
作者
王丽雯
黄旭
WANG Li-wen;HUANG Xu(Xi'an University of Science and Technology,Xi'an Shanxi 710054,China)
出处
《计算机仿真》
北大核心
2019年第7期367-370,共4页
Computer Simulation
关键词
大数据分析
不完备数据
多重填补
Big data analysis
Incomplete data
Multiple imputations