面向多样化数据清洗任务的证据集智能选择方法

Intelligent Evidence Set Selection Method for Diverse Data Cleaning Tasks

下载PDF

导出

摘要由于针对单一特定数据质量问题而设计的数据清洗算法并不总能有效地适用于多种清洗需求共存的数据质量提升技术,因此可采用多种清洗方法互相配合的方式来解决各种数据清洗需求。将数据清洗问题转换为证据集的生成和选择问题,基于聚合查询的增量式质量评估方案和基于中间算子证据集的算子结果选择方案,在多种清洗任务下实现了多种清洗方法配合的高效数据清洗。在所提清洗模型中,算子库提供数据清洗结果并将其转换为中间算子;中游的采样器将中间算子集分流和剪枝,给搜索器提供优质的候选证据集;下游的搜索器在质量评估器的指导下进行证据集的选择,搜索完毕后向上游算子库更新数据和必要的参数,使算子库重新迭代生成中间算子。最后,基于3个不同规模的真实数据集进行了大量实验,通过不同数据清洗任务下的性能验证在任意种类的数据清洗需求下算子编排的可行性,并将所提方法和现有的智能数据清洗系统进行性能对比。结果表明,在多种清洗任务中,所提方法在多种数据质量约束、动态和大规模的数据清洗方面具有稳定的准确率和召回率,且同一清洗时间下异常值、规则违反和混合错误的清洗任务性能优于其他智能数据清洗系统15%以上。 Due to the limitations of data cleaning algorithms designed specifically for individual data quality issues and their inability to effectively address multiple coexisting data quality enhancement requirements,a collaborative approach employing multiple data cleaning methods can be adopted to fulfill various data cleaning needs.This paper formulates the data cleaning problem as a task of evidence set generation and selection.By utilizing an incremental quality assessment scheme based on aggregate queries and an operator result selection scheme based on intermediate operator evidence sets,efficient data cleaning involving a combination of diverse cleaning methods is achieved across various cleaning tasks.In the proposed cleaning model,the operator repository yields data cleaning results and transforms them into intermediate operators.The sampler in the midstream module distributes and prunes the set of intermediate operators to provide the searcher with a high-quality candidate evidence set.The downstream searcher,guided by the quality evaluator,selects evidence sets.Upon completion of the search process,the upstream operator repository updates data and necessary parameters,facilitating the reiteration of intermediate operator generation.Finally,extensive experiments are conducted on three real-world datasets of varying scales.Performance verification across different data cleaning tasks demonstrates the feasibility of operator orchestration for any type of data cleaning requirement,underpinning the proposed method’s stable precision and recall in scenarios involving diverse data quality constraints,dynamics,and large-scale data clea-ning.Furthermore,a performance comparison with existing intelligent data cleaning systems reveals that the proposed method outperforms these systems by over 15%in tasks related to outlier detection,rule violations,and mixed errors,all within the same cleaning time.

作者钱泽凯丁小欧孙哲王宏志张岩 QIAN Zekai;DING Xiaoou;SUN Zhe;WANG Hongzhi;ZHANG Yan(College of Computer Science and Technology,Harbin Institute of Technology,Harbin 150006,China)

机构地区哈尔滨工业大学计算机科学与技术学院

出处《计算机科学》 CSCD 北大核心 2024年第8期124-132,共9页 Computer Science

基金国家重点研发计划(2021YFB3300502) 国家自然科学基金(62232005,62202126) 中国博士后科学基金(2022M720957) 黑龙江省博士后资助项目(LBH-Z21137)。

关键词数据清洗数据质量评估流水线系统设计算子选择证据集 Data cleaning Data quality assessment Pipeline system design Operator selection Evidence set

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献4

1Prerna Singh.Systematic review of data-centric approaches in artificial intelligence and machine learning[J].Data Science and Management,2023,6(3):144-157. 被引量：2
2丁小欧,王宏志,张笑影,李建中,高宏.数据质量多种性质的关联关系研究[J].软件学报,2016,27(7):1626-1644. 被引量：35
3郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(11):2076-2082. 被引量：270
4郝爽,李国良,冯建华,王宁.结构化数据清洗技术综述[J].清华大学学报（自然科学版）,2018,58(12):1037-1050. 被引量：78

二级参考文献29

1Aebi, D., Perrochon, L. Towards improving data quality. In: Sarda, N.L., ed. Proceedings of the International Conference on Information Systems and Management of Data. Delhi, 1993. 273～281.
2Wang, R.Y., Kon, H.B., Madnick, S.E. Data quality requirements analysis and modeling. In: Proceedings of the 9th International Conference on Data Engineering. Vienna: IEEE Computer Society, 1993. 670～677.
3Rahm, E., Do, H.H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000,23(4):3～13.
4Galhardas, H., Florescu, D., Shasha, D., et al. AJAX: an extensible data cleaning tool. In: Chen, W.D., Naughton, J.F., Bernstein, P.A., eds. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Texas: ACM, 2000. 590.
5Hernandez, M.A., Stolfo, S.J. Real-World data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998,2(1):9～37.
6Lee, M.L., Ling, T.W., Lu, H.J., et al. Cleansing data for mining and warehousing. In: Bench-Capon, T., Soda, G., Tjoa, A.M., eds. Database and Expert Systems Applications. Florence: Springer, 1999. 751～760.
7Monge, A.E. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 2000,23(4):14～20.
8Monge, A.E., Elkan, C. The field matching problem: algorithms and applications. In: Simoudis, E., Han, J.W., Fayyad, U., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Oregon: AAAI Press, 1996. 267～270.
9Savasere, A., Omiecinski, E., Navathe, S.B. An efficient algorithm for mining association rules in large databases. In: Dayal, U., Gray, P., Nishio, S., eds. Proceedings of the 21st International Conference on Very Large Data Bases. Zurich: Morgan Kaufmann, 1995. 432～444.
10Srikant, R., Agrawal, R. Mining Generalized Association Rules. In: Dayal, U., Gray, P., Nishio, S., eds. Proceedings of the 21st International Conference on Very Large Data Bases. Zurich: Morgan Kaufmann, 1995. 407～419.

共引文献368

1鄢浩宇.证券交易数据的权属界定与权益行使[J].证券法律评论,2022(1):405-441.
2梁莉莉,布瑞丰.非遗视频直播的技术逻辑及其潜在风险——基于抖音平台的“田野”观察[J].青海民族研究,2022,33(3):136-141. 被引量：7
3李垚周,李光明.分布式数据清洗系统设计[J].网络安全技术与应用,2020,0(2):60-62. 被引量：7
4赵彦军.金融数据治理中关于安全与质量的平衡性研究[J].黑龙江金融,2022(10):77-80. 被引量：1
5王利民,李硕硕,王学鑫,冯志江,司亚超,邓全才,吴永强.基于grubbs检验的中水压力数据清洗[J].河北建筑工程学院学报,2022,40(4):144-147.
6杨柳林,胡贺骏.基于改进GMM算法的综合能源数据清洗研究[J].电子测量技术,2023,46(4):78-83. 被引量：3
7丁小欧,王宏志,靳贺霖,高猛.时序数据错误检测与修复研究综述[J].智能计算机与应用,2021,11(12):1-6. 被引量：5
8周力,张勃.向Oracle进行数据移植的方法[J].沈阳大学学报,2003,15(2):38-39. 被引量：3
9宋峥嵘,朴春梅.数据质量与数据清理浅谈[J].今日科苑,2009(17).
10毕锟,刘军.ETL系统的设计及其研究[J].软件导刊,2010,9(5):173-175. 被引量：9

1朱皓晨,赵墨,曹刚.联合内容和质量约束的真实图像去噪[J].信号处理,2024,40(6):1141-1147.
2胡纯意,李建辉,胡纯蓉,方定明,周锐深.基于STM32的AI智能农业系统[J].物联网技术,2024,14(7):111-114. 被引量：1
3苏红英,袁卓亚,李莉,杨远,房晓飞,王运林,刘其智,吴元作,张开,杨鹏辉,侯社教.集智汇力,推动“四个经济”突破发展[J].西部大开发,2024(6):14-25.
4张森,李灿,朱骏飞.智慧高速大脑总体框架及应用场景研究[J].警察技术,2024(4):84-86.
5肖博健,曹霑懋,许莉芬.多任务学习在不良言论与个体特征检测中的应用[J].计算机系统应用,2024,33(7):74-83.
6尹秀秀,张芳绯.智慧生态城市规划建设基本理论探讨[J].工程与建设,2024,38(3):517-518. 被引量：1
7陈增照,王政,郑秋雨.基于全范围头部姿态估计的教师注意力识别算法[J].计算机工程,2024,50(7):96-103.
8全球首艘智能研究与实训两用船“新红专”轮交付[J].中国船检,2024(7):7-7.
9朱铭.集智汇力铸重器——陕西省宝鸡市政协助推优势装备制造业高质量发展[J].中国政协,2024(14):70-71.
10李后强.弘扬科学家精神推动新质生产力加快发展[J].人民周刊,2024(13):42-47.

计算机科学

2024年第8期

浏览历史

内容加载中请稍等...

面向多样化数据清洗任务的证据集智能选择方法

参考文献4

二级参考文献29

共引文献368

相关作者

相关机构

相关主题

浏览历史