期刊文献+

海量数据的相似重复记录检测算法 被引量:11

Algorithm for detecting approximate duplicate records in massive data
下载PDF
导出
摘要 针对海量数据下相似重复记录检测算法的低查准率和低效率问题,采用综合加权法和基于字符串长度过滤法对数据集进行相似重复检测。综合加权法通过结合用户经验和数理统计法计算各属性的权重。基于字符串长度过滤法在相似检测过程中利用字符串间的长度差异提前结束编辑距离算法的计算,减少待匹配的记录数。实验结果表明,通过综合加权法计算的权重向量更加全面、准确反映出各属性的重要性,基于字符串的长度过滤法减少了记录间的比对时间,能够有效地解决海量数据的相似重复记录检测问题。 For the problem of low precision and low time efficiency of approximate duplicate records detection algorithm in massive data,integrated weighted method and filtration method based on the length of strings were adopted to do the approximate duplicate records detection in dataset.Integrated weighted method integrated user experience and mathematical statistics to calculate the weight of each attribute to make weight calculation more scientific.The filtration method based on the length of strings made use of the length difference between strings to terminate the edit distance algorithm earlier which reduced the number of the records to be matched during the detection process.The experimental results show that the weight vector calculated by the integrated weighted method makes the importance of each field more comprehensive and accurate.The filtration method based on the length of strings reduces the comparison time among records and effectively solves the problem of the detection of approximate duplicate records under massive data.
出处 《计算机应用》 CSCD 北大核心 2013年第8期2208-2211,共4页 journal of Computer Applications
基金 江苏省科技支撑项目(BE2011156)
关键词 海量数据 相似重复记录 综合加权法 编辑距离 massive data approximate duplicate record integrated weighted method edit distance
  • 相关文献

参考文献10

  • 1MONGE A E, ELKAN C P. The field matching problem: algorithms and applications [ C]// Proceedings of the 2nd Conference on Knowledge Discovery and Data Mining. Cambridge: AAAI, 1996: 267 - 270.
  • 2MINTON S N, NANJO C, KNOBLOCK C A, et al. A heterogene- ous field matching method for record linkage [ C]// Proceeding of the 5th IEEE International Conference on Data Mining. Piseataway: IEEE, 2005:314-321.
  • 3HERNANDEZ M, STOLFO S. The merge/purge problem for large databases [C]// Proceedings of the 1995 ACM SIGMOD Interna- tional Conference on Management of Data. New York: ACM, 1995: 127 - 138.
  • 4BLENK O M, MOONEY R. Adaptive name matching in information integration [ J]. IEEE Intelligent Systems, 2003, 18 (5) : 16 - 23.
  • 5邱越峰,田增平,季文贇,周傲英.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24(1):69-77. 被引量:72
  • 6鲁均云,李星毅,施化吉,马素琴.基于内码序值聚类的相似重复记录检测方法[J].计算机应用研究,2010,27(3):874-878. 被引量:8
  • 7孟祥逢,鲁汉榕,郭玲.基于遗传神经网络的相似重复记录检测方法[J].计算机工程与设计,2010,31(7):1550-1553. 被引量:13
  • 8李星毅,包从剑,施化吉.数据仓库中的相似重复记录检测方法[J].电子科技大学学报,2007,36(6):1273-1277. 被引量:25
  • 9MONGE A E, ELKAN C. An efficient domain-independent algo- rithm for detecting approximately duplicate database records [ C]// Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery. Cambridge: AAAI, 1997: 23 - 29.
  • 10张永,迟忠先.位置编码在数据仓库ETL中的应用[J].计算机工程,2007,33(1):50-52. 被引量:12

二级参考文献48

共引文献108

同被引文献68

引证文献11

二级引证文献33

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部