摘要
大数据具有体量大、来源和格式多样、增长速度快、价值密度低和处理难度大的特点,即使通过合理设计参数对某段数据进行点估计的结果可能相当令人满意,但通过应用标准统计程序对整个数据体进行精度估计得到的结果,却是差强人意,从而误导人们。目的旨在分析影响大数据清洗的主要因素有哪些,首先回顾了数据获取对时间序列的依赖性并构造了一个大数据模型,然后在列出数据估计时所用的一些属性后,给出了数据清洗时的回归分析,同时探讨回归系数估计的可能影响。最后给出了大数据处理中误差累积的通用表示方法,提出了与时间序列理论中短程和长程依赖之间的区别大致相同的问题。
Big data has the characteristics of large volume,diverse sources and formats,rapid growth,low value density and difficult processing. Even if the result of a point estimation of a piece of data with a reasonable design parameter may be quite satisfactory,but the accuracy of the entire data body through the application of standard statistical procedures to estimate the results may still be unsatisfactory,thus misleading people. The purpose of this paper is to analyze the main factors affecting big data cleaning. The article first reviews the dependence of data acquisition on time series and constructs a big data model. Then,the regression analysis of data cleaning is given after some properties of data estimation are listed. The possible influence of regression coefficient estimation is also discussed. In the end,the general representation method of error accumulation in big data processing is given and the problem that the difference between short-range and long-range dependence in time series theory is roughly the same is proposed.
作者
康鲲鹏
KANG Kunpeng(School of Information Technology,Shangqiu Normal University,476000,Shangqiu,Henan,PR)
出处
《江西科学》
2018年第4期654-657,共4页
Jiangxi Science
基金
河南省科技攻关项目(No.182102210486)
河南省高等学校重点科研项目(No.18A520008)
关键词
数据清洗
方差分量
大数据
长程依赖
多级模型
时间序列
data cleaning
variance components
large data
long-range dependence
multilevel model
time series