期刊文献+

An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging 被引量:1

An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging
下载PDF
导出
摘要 Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in high scalability mode, but due to the lack of effective design, there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found that some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning. Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in high scalability mode, but due to the lack of effective design, there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found that some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning.
出处 《Journal of Computer and Communications》 2020年第3期1-19,共19页 电脑和通信(英文)
关键词 GRAIN BIG DATA DATA Cleaning TASK MERGING Hadoop MAPREDUCE Grain Big Data Data Cleaning Task Merging Hadoop MapReduce
  • 相关文献

参考文献6

二级参考文献51

  • 1吴立增,朱永利,苑津莎.基于贝叶斯网络分类器的变压器综合故障诊断方法[J].电工技术学报,2005,20(4):45-51. 被引量:57
  • 2邓大才.粮食宏观调控的运行机制研究[J].经济问题,2005(5):49-51. 被引量:4
  • 3Han J,Kamber M.数据挖掘:概念与技术[M].北京:机械工业出版社,2007.
  • 4本报特约评论员 程国强.粮价“两难”困局有正解[N].农民日报.2012(002)
  • 5[日]岡部守,章政等编著.日本农业概论[M]. 中国农业出版社, 2004
  • 6Rahm E, Do H H. Data cleaning: problems and current approaches[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.
  • 7Ponniah P. Data warehousing fundamentals: a comprehensive guide for IT professionals[M]. Hoboken, NJ, USA: John Wiley & Sons, 2004.
  • 8Batini C, Scannapieco M. Data quality: concepts, methodologies and techniques[M]. New York, USA: Springer, 2006.
  • 9Benge J, Jordan G M W, Smith P, et a1. Global data management survey: the new economy is the data economy[R]. Coopers, Price Waterhouse, 2001.
  • 10Eckerson W W. Data quality and the bottom line[R/OL]. The Data Warehouse Institute (2002)[2014-09-10]. http:// www.tdwi.org/researchidisp1ay.aspx?ID=6064.

共引文献190

同被引文献15

引证文献1

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部