期刊文献+

基于大数据的数据清洗研究 被引量:2

Research on Data Cleaning Based on Big Data
下载PDF
导出
摘要 大数据具有体量大、来源和格式多样、增长速度快、价值密度低和处理难度大的特点,即使通过合理设计参数对某段数据进行点估计的结果可能相当令人满意,但通过应用标准统计程序对整个数据体进行精度估计得到的结果,却是差强人意,从而误导人们。目的旨在分析影响大数据清洗的主要因素有哪些,首先回顾了数据获取对时间序列的依赖性并构造了一个大数据模型,然后在列出数据估计时所用的一些属性后,给出了数据清洗时的回归分析,同时探讨回归系数估计的可能影响。最后给出了大数据处理中误差累积的通用表示方法,提出了与时间序列理论中短程和长程依赖之间的区别大致相同的问题。 Big data has the characteristics of large volume,diverse sources and formats,rapid growth,low value density and difficult processing. Even if the result of a point estimation of a piece of data with a reasonable design parameter may be quite satisfactory,but the accuracy of the entire data body through the application of standard statistical procedures to estimate the results may still be unsatisfactory,thus misleading people. The purpose of this paper is to analyze the main factors affecting big data cleaning. The article first reviews the dependence of data acquisition on time series and constructs a big data model. Then,the regression analysis of data cleaning is given after some properties of data estimation are listed. The possible influence of regression coefficient estimation is also discussed. In the end,the general representation method of error accumulation in big data processing is given and the problem that the difference between short-range and long-range dependence in time series theory is roughly the same is proposed.
作者 康鲲鹏 KANG Kunpeng(School of Information Technology,Shangqiu Normal University,476000,Shangqiu,Henan,PR)
出处 《江西科学》 2018年第4期654-657,共4页 Jiangxi Science
基金 河南省科技攻关项目(No.182102210486) 河南省高等学校重点科研项目(No.18A520008)
关键词 数据清洗 方差分量 大数据 长程依赖 多级模型 时间序列 data cleaning variance components large data long-range dependence multilevel model time series
  • 相关文献

参考文献4

二级参考文献46

  • 1黄书剑.时序数据上的数据挖掘.软件学报,2004,15(1):1-7.
  • 2杨一鸣,潘嵘,潘嘉林,杨强,李磊.时间序列分类问题的算法比较[J].计算机学报,2007,30(8):1259-1266. 被引量:40
  • 3AhaltSC.为什么需要数据科学[J].中国计算机学会通讯,2013,9(12):11-15.
  • 4大数据史记2013:盘点中国2013行业数据量[OL].http://www.36dsj.com/archives/6285,2013.
  • 5Zikopoupos P C,Eaton C, de Roos D, et al. Under- standing Big Data, Analytics for Enterprise Class Hadoop and Streaming Data [ OL]. http..//public. dhe. ibm. com/common/ssi/ecm/ en/im114296usen/ IML14296USEN. PDF, 2012.
  • 6Karel R. See Big Data Through a Different Lens [OL]. https : //www. informatica, corn/potential-at- work/information-leaders/article/see-big data. sht- ml,2013.
  • 7李德仁,王树良,李德毅.空间数据挖掘理论与应用[M].2版.北京:科学出版社,2013.
  • 8Li Q Q, Zhang T, Yu Y. Using Cloud Computing to Process Intensive Floating Car Data for Urban Traffic Surveillance[J]. International Journal of Geographical Information Science, 2011, 25 (8) : 1 301-1 322.
  • 9Li D R, Cheng T. KDG Knowledge Discovery from GIS[C]. The Canadian Conference on GIS, Ottawa, Canada, 1994.
  • 10Wong P C,Thomas J. Visual Analytics[J]. IEEE Computer Graphics and Applications, 2004, 24 (5) : 20-21.

共引文献230

同被引文献25

引证文献2

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部