期刊文献+

基于Hadoop平台的数据清洗研究 被引量:3

Data Cleaning Based on Hadoop Architecture
下载PDF
导出
摘要 各行各业数据的指数级增长,导致数据仓库建设管理,数据库中知识发现和总体数据质量管理中涉及的重复数据、数据值缺失、错误记录、没有意义的异常值等数据问题越来越棘手。这三个领域也是数据清洗的主要领域。基于当前现状,结合当前各大企业数据处理的平台,利用Hadoop平台中的相关组件对企业中的完全重复的数据和相似重复的数据进行清洗研究。 The exponential growth of data from all walks of life has led to data warehouse construction management,and data problems such as duplicate data,missing data values,error records,and meaningless outliers involved in knowledge discovery and overall data quality management in databases are becoming more and more difficult.These three areas are also the main areas of data cleaning.Based on the current status,combined with the data processing platform of major enterprises,the relevant components in the Hadoop platform are used to clean the completely repetitive data and similar duplicate data in the enterprise.
作者 范会丽 彭宁 任薇 FAN Hui-li;PENG Ning;REN Wei(College of Information Engineering,North China University of Science and Technology,Tangshan 063210,China)
出处 《电脑知识与技术》 2020年第5期27-28,共2页 Computer Knowledge and Technology
关键词 HADOOP平台 数据清洗 完全重复数据 相似重复数据 Hadoop platform data cleaning completely duplicate data similar duplicate data
  • 相关文献

参考文献2

二级参考文献46

  • 1崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(S1):12-18. 被引量:141
  • 2韩京宇,胡孔法,徐立臻,董逸生.一种在线数据清洗方法[J].应用科学学报,2005,23(3):292-296. 被引量:2
  • 3刘奕群,张敏,马少平.面向信息检索需要的网络数据清理研究[J].中文信息学报,2006,20(3):70-77. 被引量:5
  • 4Hon D B, Dewi V J.Duplicate record elimination in large data files[J].ACM Transactions on Database Sys- tem, 1995.
  • 5Lee M L, Lu H, Ling T W, et al.Cleaning data for mining and warehousing[C]//DEXA'99,1999.
  • 6Fan Wenfei.Extending dependencies with conditions for data cleaning[C]//8th IEEE International Conference on Computer and Information Technology, 2008 :185-190.
  • 7Eckerson W W.Data quality and the bottom line:achiev- ing business success through a commitment to high quality data[R].The Data Warehousing Institute,2002.
  • 8English L.Plain English on data quality: information quality management:the next frontier[J].DM Review Magazine, 2000.
  • 9Eppler M J, Algesheimer R, Dimpfel M.Quality criteria of content-driven websites and their influence on cus- tomer satisfaction and loyalty: an empirical test of an information quality framework[C]//Sth International Con- ference on Information Quality(IQ 2003 ), 2003 : 108-120.
  • 10Shilakes C C'Tylman J.Enterprise information portals[Z]. 1998.

共引文献47

同被引文献18

引证文献3

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部