期刊文献+

结构化数据清洗技术综述 被引量:65

Survey of structured data cleaning methods
原文传递
导出
摘要 数据清洗是对脏数据进行检测和纠正的过程,是进行数据分析和管理的基础。该文对经典和新兴的数据清洗技术进行分类和总结,为进一步的研究工作提供方向。形式化定义了数据清洗问题,对数据缺失、数据冗余、数据冲突和数据错误这4种数据噪声的检测技术进行详细阐述。按照数据清洗方式对数据噪声的消除技术进行分类概述,包括基于完整性约束的数据清洗算法、基于规则的数据清洗算法、基于统计的数据清洗算法和人机结合的数据清洗算法。介绍了常用的测评数据集和噪声注入工具,并对未来重点的研究方向进行了探讨和展望。 Data cleaning is the process of detecting and repairing dirty data which is often needed in data analysis and management.This paper classifies and summarizes the traditional and advanced data cleaning techniques and identifies potential directions for further work.This study first formally defines the cleaning problem for structured data and then describes error detection methods for missing data,redundant data,conflicting data and erroneous data. The data cleaning methods are then summarized based on their error elimination method,including constraint-based data cleaning, rule-based data cleaning,statistical data cleaning and human-in-the-loop data cleaning.Some important datasets and noise injection tools are introduced as well.Open research problems and future research directions are also discussed.
作者 郝爽 李国良 冯建华 王宁 HAO Shuang;LI Guoliang;FENG Jianhua;WANG Ning(School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China;Database Group,Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2018年第12期1037-1050,共14页 Journal of Tsinghua University(Science and Technology)
基金 国家重点研发计划项目(2018YFC0809800) 国家自然科学基金项目(61373024,61632016,61422205,61521002)
关键词 数据清洗 数据噪声 噪声检测 噪声消除 data cleaning dirty data error detection error elimination
  • 相关文献

参考文献4

二级参考文献49

  • 1王咏梅,陈家琪,耿玉良.一种可交互的数据清洗系统[J].计算机工程与设计,2005,26(4):955-957. 被引量:7
  • 2刘奕群,张敏,马少平.面向信息检索需要的网络数据清理研究[J].中文信息学报,2006,20(3):70-77. 被引量:5
  • 3Aebi, D., Perrochon, L. Towards improving data quality. In: Sarda, N.L., ed. Proceedings of the International Conference on Information Systems and Management of Data. Delhi, 1993. 273~281.
  • 4Wang, R.Y., Kon, H.B., Madnick, S.E. Data quality requirements analysis and modeling. In: Proceedings of the 9th International Conference on Data Engineering. Vienna: IEEE Computer Society, 1993. 670~677.
  • 5Rahm, E., Do, H.H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000,23(4):3~13.
  • 6Galhardas, H., Florescu, D., Shasha, D., et al. AJAX: an extensible data cleaning tool. In: Chen, W.D., Naughton, J.F., Bernstein, P.A., eds. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Texas: ACM, 2000. 590.
  • 7Hernandez, M.A., Stolfo, S.J. Real-World data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998,2(1):9~37.
  • 8Lee, M.L., Ling, T.W., Lu, H.J., et al. Cleansing data for mining and warehousing. In: Bench-Capon, T., Soda, G., Tjoa, A.M., eds. Database and Expert Systems Applications. Florence: Springer, 1999. 751~760.
  • 9Monge, A.E. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 2000,23(4):14~20.
  • 10Monge, A.E., Elkan, C. The field matching problem: algorithms and applications. In: Simoudis, E., Han, J.W., Fayyad, U., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Oregon: AAAI Press, 1996. 267~270.

共引文献364

同被引文献824

引证文献65

二级引证文献261

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部