摘要
由于针对单一特定数据质量问题而设计的数据清洗算法并不总能有效地适用于多种清洗需求共存的数据质量提升技术,因此可采用多种清洗方法互相配合的方式来解决各种数据清洗需求。将数据清洗问题转换为证据集的生成和选择问题,基于聚合查询的增量式质量评估方案和基于中间算子证据集的算子结果选择方案,在多种清洗任务下实现了多种清洗方法配合的高效数据清洗。在所提清洗模型中,算子库提供数据清洗结果并将其转换为中间算子;中游的采样器将中间算子集分流和剪枝,给搜索器提供优质的候选证据集;下游的搜索器在质量评估器的指导下进行证据集的选择,搜索完毕后向上游算子库更新数据和必要的参数,使算子库重新迭代生成中间算子。最后,基于3个不同规模的真实数据集进行了大量实验,通过不同数据清洗任务下的性能验证在任意种类的数据清洗需求下算子编排的可行性,并将所提方法和现有的智能数据清洗系统进行性能对比。结果表明,在多种清洗任务中,所提方法在多种数据质量约束、动态和大规模的数据清洗方面具有稳定的准确率和召回率,且同一清洗时间下异常值、规则违反和混合错误的清洗任务性能优于其他智能数据清洗系统15%以上。
Due to the limitations of data cleaning algorithms designed specifically for individual data quality issues and their inability to effectively address multiple coexisting data quality enhancement requirements,a collaborative approach employing multiple data cleaning methods can be adopted to fulfill various data cleaning needs.This paper formulates the data cleaning problem as a task of evidence set generation and selection.By utilizing an incremental quality assessment scheme based on aggregate queries and an operator result selection scheme based on intermediate operator evidence sets,efficient data cleaning involving a combination of diverse cleaning methods is achieved across various cleaning tasks.In the proposed cleaning model,the operator repository yields data cleaning results and transforms them into intermediate operators.The sampler in the midstream module distributes and prunes the set of intermediate operators to provide the searcher with a high-quality candidate evidence set.The downstream searcher,guided by the quality evaluator,selects evidence sets.Upon completion of the search process,the upstream operator repository updates data and necessary parameters,facilitating the reiteration of intermediate operator generation.Finally,extensive experiments are conducted on three real-world datasets of varying scales.Performance verification across different data cleaning tasks demonstrates the feasibility of operator orchestration for any type of data cleaning requirement,underpinning the proposed method’s stable precision and recall in scenarios involving diverse data quality constraints,dynamics,and large-scale data clea-ning.Furthermore,a performance comparison with existing intelligent data cleaning systems reveals that the proposed method outperforms these systems by over 15%in tasks related to outlier detection,rule violations,and mixed errors,all within the same cleaning time.
作者
钱泽凯
丁小欧
孙哲
王宏志
张岩
QIAN Zekai;DING Xiaoou;SUN Zhe;WANG Hongzhi;ZHANG Yan(College of Computer Science and Technology,Harbin Institute of Technology,Harbin 150006,China)
出处
《计算机科学》
CSCD
北大核心
2024年第8期124-132,共9页
Computer Science
基金
国家重点研发计划(2021YFB3300502)
国家自然科学基金(62232005,62202126)
中国博士后科学基金(2022M720957)
黑龙江省博士后资助项目(LBH-Z21137)。
关键词
数据清洗
数据质量评估
流水线系统设计
算子选择
证据集
Data cleaning
Data quality assessment
Pipeline system design
Operator selection
Evidence set