An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging 被引量：1

An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging

下载PDF

导出

摘要 Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in high scalability mode, but due to the lack of effective design, there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found that some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning. Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in high scalability mode, but due to the lack of effective design, there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found that some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning.

作者 Feiyu Lian Maixia Fu Xingang Ju

机构地区 Key Laboratory of Grain Information Processing and Control (Henan University of Technology) [

出处《Journal of Computer and Communications》 2020年第3期1-19,共19页 电脑和通信（英文）

关键词 GRAIN BIG DATA DATA Cleaning TASK MERGING Hadoop MAPREDUCE Grain Big Data Data Cleaning Task Merging Hadoop MapReduce

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献6

1霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法[J].计算机研究与发展,2013,50(S2):170-179. 被引量：9
2程国强,朱满德.中国粮食宏观调控的现实状态与政策框架[J].改革,2013(1):18-34. 被引量：83
3王钜钩.提高全社会粮食统计数据质量的若干思考[J].浙江工商职业技术学院学报,2007,6(1):21-23. 被引量：2
4严英杰,盛戈皞,陈玉峰,江秀臣,郭志红,秦少鹏.基于时间序列分析的输变电设备状态大数据清洗方法[J].电力系统自动化,2015,39(7):138-144. 被引量：89
5张安珍,门雪莹,王宏志,李建中,高宏.大数据上基于Hadoop的不一致数据检测与修复算法[J].计算机科学与探索,2015,9(9):1044-1055. 被引量：13
6Cheqing JIN,Jie CHEN,Huiping LIU.MapReduce-based entity matching with multiple blocking functions[J].Frontiers of Computer Science,2017,11(5):895-911. 被引量：1

二级参考文献51

1吴立增,朱永利,苑津莎.基于贝叶斯网络分类器的变压器综合故障诊断方法[J].电工技术学报,2005,20(4):45-51. 被引量：57
2邓大才.粮食宏观调控的运行机制研究[J].经济问题,2005(5):49-51. 被引量：4
3Han J,Kamber M.数据挖掘:概念与技术[M].北京:机械工业出版社,2007.
4本报特约评论员程国强.粮价“两难”困局有正解[N].农民日报.2012(002)
5[日]岡部守，章政等编著.日本农业概论[M]. 中国农业出版社, 2004
6Rahm E, Do H H. Data cleaning: problems and current approaches[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.
7Ponniah P. Data warehousing fundamentals: a comprehensive guide for IT professionals[M]. Hoboken, NJ, USA: John Wiley & Sons, 2004.
8Batini C, Scannapieco M. Data quality: concepts, methodologies and techniques[M]. New York, USA: Springer, 2006.
9Benge J, Jordan G M W, Smith P, et a1. Global data management survey: the new economy is the data economy[R]. Coopers, Price Waterhouse, 2001.
10Eckerson W W. Data quality and the bottom line[R/OL]. The Data Warehouse Institute (2002)[2014-09-10]. http:// www.tdwi.org/researchidisp1ay.aspx?ID=6064.

共引文献190

1高婧,李阳.托市收购政策下粮食供应链利益协调机制研究[J].粮食经济研究,2019(2):39-53.
2王晓东,朱晴晴,王诗桪.中国粮食流通体制改革与粮食流通效率——兼议政府与市场的作用[J].产业经济评论（山东）,2022(2):21-46.
3蒲天骄,乔骥,韩笑,张国宾,王新迎.人工智能技术在电力设备运维检修中的研究及应用[J].高电压技术,2020,46(2):369-383. 被引量：215
4冯泽磊,吴美凤.动态浮箱数据清洗方法在电力系统中的应用[J].发电技术,2019,40(S1):109-113. 被引量：5
5陈健,刘云慧,宇振荣.基于时序MODIS-EVI数据的冬小麦种植信息提取[J].中国农学通报,2011,27(1):446-450. 被引量：28
6吕东辉,杨印生.跨国粮商套期保值行为研究[J].农业经济问题,2013,34(12):77-80. 被引量：5
7陈雅芝.农业对外经济合作与粮食安全保障能力分析[J].农学学报,2013,3(11):55-59.
8夏益国,孙群,刘艳华,盛新新.建构农业安全网:美国经验和中国实践及政策建议[J].农业现代化研究,2014,35(3):257-262. 被引量：4
9王守相,陈海文,潘志新,王建明.采用改进生成式对抗网络的电力系统量测缺失数据重建方法[J].中国电机工程学报,2019,39(1):56-64. 被引量：87
10曹宝明,刘婷,虞松波.中国粮食流通体制改革:目标、路径与重启[J].农业经济问题,2018,39(12):33-38. 被引量：36

同被引文献15

1张艺镨,艾小猛,方家琨,仉梦林,姚伟,文劲宇.基于广义凸包不确定集合的数据驱动鲁棒机组组合[J].中国电机工程学报,2020,40(2):477-487. 被引量：22
2姚鹏川.基于数据驱动的核动力装置状态监测方法研究[J].核动力工程,2020(S01):135-139. 被引量：3
3邹同华,高云鹏,伊慧娟,徐长宝,夏睿,吴聪.基于Thompson tau-四分位和多点插值的风电功率异常数据处理[J].电力系统自动化,2020(15):156-165. 被引量：44
4陈真勇,徐州川,李清广,吕卫锋,熊璋.一种新的智慧城市数据共享和融合框架——SCLDF[J].计算机研究与发展,2014,51(2):290-301. 被引量：35
5袁宇,关涛,闫相斌,李一军.基于混合VIKOR方法的供应商选择决策模型[J].控制与决策,2014,29(3):551-560. 被引量：96
6冯宏伟,姚博,高原,王惠亚,冯筠.基于边界混合采样的非均衡数据处理算法[J].控制与决策,2017,32(10):1831-1836. 被引量：19
7李兢,乔颖,龚莺飞.考虑时空相关性的光辐照度序列估计方法[J].电力系统自动化,2017,41(22):96-101. 被引量：4
8马春光,安婧,毕伟,袁琪.区块链中的智能合约[J].信息网络安全,2018(11):8-17. 被引量：65
9曹瑜,王楠,徐志超.Spark框架结合分布式KNN分类器的网络大数据分类处理方法[J].计算机应用研究,2019,36(11):3274-3277. 被引量：8
10浦雨婷,杨洪耕,马晓阳.基于数据挖掘与改进灰靶的电压暂降严重度分析与评估[J].电力系统自动化,2020,44(2):198-206. 被引量：23

引证文献1

1陈广,宋志伟,陈少兵,贺绍鹏,毛烨华,李泽坤.数据感知技术在电力物资供应链数据质量管理中的应用[J].科技管理研究,2021,41(18):182-191. 被引量：21

二级引证文献21

1张茂君,李俊华,邢海涛,朱庭楠,孙健.基于Hadoop和Flink的电力供应链数据中台建设与应用[J].电力大数据,2022,25(2):55-63. 被引量：2
2徐峰,张彦雨,龚其国.大数据驱动下的生产运营管理研究[J].数学的实践与认识,2022,52(12):279-292. 被引量：1
3张洲洋.大数据技术在装备制造业中的运用研究[J].中国设备工程,2023(3):21-23.
4张萌,贺绍鹏,朱文立,戎袁杰,宋志伟.数据驱动的电工装备市场规模预测架构研究[J].电力大数据,2022,25(11):56-62. 被引量：1
5周岳,王腾飞,张有志,陈娇.新时期供应链创新发展:绿色韧性数字“ 三元供应链 ” 的内涵特征与策略建议[J].供应链管理,2023,4(4):56-69. 被引量：2
6尹玉芬.基于电力营销的客户画像标签建立及应用研究[J].现代科学仪器,2023,40(2):131-136.
7刘晶晶,张华强,陈嘉羽,褚莉,杨涛.基于大数据的电力物资供应链安全风险监测系统[J].工业加热,2023,52(4):64-68.
8张晨辉.电力物资供应链运营体系建设分析[J].通讯世界,2023,30(7):100-102.
9张志翔,罗文海.基于纳什协商的电力交易策略研究[J].微型电脑应用,2023,39(11):76-79.
10王奕萱,李翼铭,徐二强,李会君,李明亮.基于特征标签的电力计量大数据深度挖掘研究[J].电子设计工程,2023,31(24):186-189. 被引量：1

1赵敏,王慧卿,张超,李洋,张建亮,高枫,任学武.基于自编码的电力信息通信资产数据清洗算法[J].山东农业大学学报（自然科学版）,2019,50(6):1093-1096. 被引量：2
2Eva Turk,Valentina Prevolnik Rupel,Alojz Tapajner,Arja Isola.Reliability and Validity of the Audit on Diabetes-Dependent Quality of Life (ADDQoL) and EQ-5D in Elderly Slovenian Diabetes Mellitus Type 2 Patients[J].Health,2014,6(8):699-711.
3Michikazu Nakai,Ding-Geng Chen,Kunihiro Nishimura,Yoshihiro Miyamoto.Comparative Study of Four Methods in Missing Value Imputations under Missing Completely at Random Mechanism[J].Open Journal of Statistics,2014,4(1):27-37. 被引量：3
4Nadim Sheikh,Maruf Hasan.Mixed Convective Flow of Micropolar Fluids past an Inclined Porous Flat Plate[J].Open Journal of Fluid Dynamics,2017,7(4):642-656.
5何俊,张德海,张云飞,杨雪.复杂业务领域数据清洗规则冲突检测方法[J].昆明理工大学学报（自然科学版）,2020,45(2):50-57. 被引量：3
6Zhuo Wu,Xiaohua Wang,Yongwen Shen,Yueting Shi.Improved Region Merging Algorithm for Remote Sensing Images[J].Journal of Beijing Institute of Technology,2020,29(1):72-79. 被引量：3
7Stéphane Kouamo,Claude Tangha,Olaf Kouamo.Reduction of False Rejection in an Authentication System by Fingerprint with Deep Neural Networks[J].Journal of Software Engineering and Applications,2020,13(1):1-13.
8余蝶,黄可儿,吴启端.β-细辛醚抗凝血的网络药理学作用机制研究[J].中药新药与临床药理,2020,31(3):324-329. 被引量：4
9Weizhong Lin,Xuan Xiao,Wangren Qiu,Kuo-Chen Chou.Use Chou’s 5-Steps Rule to Predict Remote Homology Proteins by Merging Grey Incidence Analysis and Domain Similarity Analysis[J].Natural Science,2020,12(3):181-198. 被引量：15
10LI Kexin,LI Jing,LIU Shuji,LI Zhao,BO Jue,LIU Biqi.GA-iForest: An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outlier Detection[J].Transactions of Nanjing University of Aeronautics and Astronautics,2019,36(6):1026-1038. 被引量：4

Journal of Computer and Communications

2020年第3期

浏览历史

内容加载中请稍等...