摘要
各行各业数据的指数级增长,导致数据仓库建设管理,数据库中知识发现和总体数据质量管理中涉及的重复数据、数据值缺失、错误记录、没有意义的异常值等数据问题越来越棘手。这三个领域也是数据清洗的主要领域。基于当前现状,结合当前各大企业数据处理的平台,利用Hadoop平台中的相关组件对企业中的完全重复的数据和相似重复的数据进行清洗研究。
The exponential growth of data from all walks of life has led to data warehouse construction management,and data problems such as duplicate data,missing data values,error records,and meaningless outliers involved in knowledge discovery and overall data quality management in databases are becoming more and more difficult.These three areas are also the main areas of data cleaning.Based on the current status,combined with the data processing platform of major enterprises,the relevant components in the Hadoop platform are used to clean the completely repetitive data and similar duplicate data in the enterprise.
作者
范会丽
彭宁
任薇
FAN Hui-li;PENG Ning;REN Wei(College of Information Engineering,North China University of Science and Technology,Tangshan 063210,China)
出处
《电脑知识与技术》
2020年第5期27-28,共2页
Computer Knowledge and Technology