摘要
数据清洗是对脏数据进行检测和纠正的过程,是进行数据分析和管理的基础。该文对经典和新兴的数据清洗技术进行分类和总结,为进一步的研究工作提供方向。形式化定义了数据清洗问题,对数据缺失、数据冗余、数据冲突和数据错误这4种数据噪声的检测技术进行详细阐述。按照数据清洗方式对数据噪声的消除技术进行分类概述,包括基于完整性约束的数据清洗算法、基于规则的数据清洗算法、基于统计的数据清洗算法和人机结合的数据清洗算法。介绍了常用的测评数据集和噪声注入工具,并对未来重点的研究方向进行了探讨和展望。
Data cleaning is the process of detecting and repairing dirty data which is often needed in data analysis and management.This paper classifies and summarizes the traditional and advanced data cleaning techniques and identifies potential directions for further work.This study first formally defines the cleaning problem for structured data and then describes error detection methods for missing data,redundant data,conflicting data and erroneous data. The data cleaning methods are then summarized based on their error elimination method,including constraint-based data cleaning, rule-based data cleaning,statistical data cleaning and human-in-the-loop data cleaning.Some important datasets and noise injection tools are introduced as well.Open research problems and future research directions are also discussed.
作者
郝爽
李国良
冯建华
王宁
HAO Shuang;LI Guoliang;FENG Jianhua;WANG Ning(School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China;Database Group,Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2018年第12期1037-1050,共14页
Journal of Tsinghua University(Science and Technology)
基金
国家重点研发计划项目(2018YFC0809800)
国家自然科学基金项目(61373024,61632016,61422205,61521002)
关键词
数据清洗
数据噪声
噪声检测
噪声消除
data cleaning
dirty data
error detection
error elimination