摘要
相似重复记录检测是数据清洗的重要环节,大数据环境下对相似重复记录检测方法的效率和精度提出了更高的要求。文章针对大数据环境下对相似重复记录检测提出了一种聚类分组检测的KCG算法。该方法首先采用改进的K-modes聚类对大数据进行有效的分组,然后在各分组中采用Pair-wise比较算法检测出所有相似重复记录检测。实验分析结果表明,该方法对大数据环境下的相似重复记录检测的效率和精度有明显提高。
Approximately duplicate record detection is one of the most important steps of data cleaning.In the environment of big data,higher requirements are put forward for the efficiency and accuracy of approximately duplicate record detection methods.A algorithm of approximately duplicated records for big data based on K-modes clustering grouping is proposed(KCG)in the paper.Firstly,the improved K-Modes clustering is used to divide the big data,and then the Pair-wise algorithm is used to detect approximately duplicate records in each group.The experimental results show that this method can significantly improve the efficiency and accuracy of approximately duplicate record detection in big data.
作者
张平
余顺
ZHANG Ping;YU Shun
出处
《安徽职业技术学院学报》
2022年第1期24-29,共6页
Journal of Anhui Vocational & Technical College
基金
2018年安徽省自然科学研究项目重点项目“Web大数据环境下相似重复数据清洗的研究”(项目编号KJ2018A0710)。