期刊文献+

基于K-modes聚类分组的大数据相似重复记录检测研究 被引量:1

About the Detection of Approximately Duplicated Records for Big Data Based on K-modes Clustering Grouping
下载PDF
导出
摘要 相似重复记录检测是数据清洗的重要环节,大数据环境下对相似重复记录检测方法的效率和精度提出了更高的要求。文章针对大数据环境下对相似重复记录检测提出了一种聚类分组检测的KCG算法。该方法首先采用改进的K-modes聚类对大数据进行有效的分组,然后在各分组中采用Pair-wise比较算法检测出所有相似重复记录检测。实验分析结果表明,该方法对大数据环境下的相似重复记录检测的效率和精度有明显提高。 Approximately duplicate record detection is one of the most important steps of data cleaning.In the environment of big data,higher requirements are put forward for the efficiency and accuracy of approximately duplicate record detection methods.A algorithm of approximately duplicated records for big data based on K-modes clustering grouping is proposed(KCG)in the paper.Firstly,the improved K-Modes clustering is used to divide the big data,and then the Pair-wise algorithm is used to detect approximately duplicate records in each group.The experimental results show that this method can significantly improve the efficiency and accuracy of approximately duplicate record detection in big data.
作者 张平 余顺 ZHANG Ping;YU Shun
出处 《安徽职业技术学院学报》 2022年第1期24-29,共6页 Journal of Anhui Vocational & Technical College
基金 2018年安徽省自然科学研究项目重点项目“Web大数据环境下相似重复数据清洗的研究”(项目编号KJ2018A0710)。
关键词 相似重复记录检测 网格密度 Pair-wise KCG approximately duplicated record detection grid density Pair-wise KCG
  • 相关文献

参考文献8

二级参考文献69

  • 1余辉,张力新,吕扬生.基于小波变换的QRS波检测[J].生物医学工程与临床,2001,5(2):65-70. 被引量:10
  • 2刘芳,何飞.基于聚类分析技术的数据清洗研究[J].计算机工程与科学,2005,27(6):70-71. 被引量:11
  • 3韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212. 被引量:32
  • 4余轮,翁友岑,郑慧如.小波变换在心电图R波检测中的应用[J].福州大学学报(自然科学版),1996,24(5):48-52. 被引量:1
  • 5[1]Bitton D, DeWitt D J. Duplicate record elimination in large data files. ACM Trans Database Systems, 1983, 8(2):255-65
  • 6[2]Hernandez M, Stolfo S. The Merge/Purge problem for large databases. In: Proc ACM SIGMOD International Conference on Management of Data, 1995. 127-138
  • 7[3]Howard B Newcombe, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130:954-959
  • 8[4]DeWitt D J, Naught J F, Schneider D A. An evaluation of non-equijoin algorithms. In: Proc 17th International Conference on Very Large Databases, Barcelona, Spain, 1991. 443-452
  • 9[5]Hylton J A. Identifying and merging related bibliographic records[MS dissertation]. MIT: MIT Laboratory for Computer Science Technical Report 678, 1996
  • 10[6]Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc DMKD'97, Tucson Arizona, 1997

共引文献110

同被引文献8

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部