期刊文献+

基于网格分组和属性权值的相似重复记录识别算法 被引量:1

An Improved Approximately Duplicate Records Detection Algorithm
下载PDF
导出
摘要 针对在处理海量数据时,传统的相似重复记录识别方法具有检测效率不高、检测精度较低等缺陷,提出了一种基于网格分组和属性权值的相似重复记录检测算法。该算法采用分而治之的思想,使用网格法将海量数据进行分组,并为各属性设立相应的权值,然后进行相似重复记录的识别。理论分析和实验表明,网格分组有效减少了记录之间的比对次数,基于属性综合权值的方法更加准确反映各属性对记录的贡献度,两者结合能够有效解决大数据的相似重复记录的识别问题。 The traditional detection algorithm has bad efficiency and low detecting precision on approxi-mately duplicate records when it deals with massive data. An improved algorithm based on the grid-based grouping and at tribute weights is proposed. The divide-and-conquer theory is used in this algorithm, and the grid method is used to group the massive data, and it sets up corresponding weights for all the attrib-utes to identify the duplicate records. Both theory and experimental show that the grid group can reduce the comparisons effectively between the records, and the method based on at tribute synthetic weights re-flects the contribution of each at tribute to the record more accurately, and a combination of both can solve the problems of recognizing the big data?s duplicate records effectively.
出处 《青岛大学学报(自然科学版)》 CAS 2017年第2期69-73,共5页 Journal of Qingdao University(Natural Science Edition)
关键词 网格分组 属性权值 相似记录检测 grid-based grouping attribute weights approximately duplicate records detection
  • 相关文献

参考文献5

二级参考文献124

共引文献114

同被引文献5

引证文献1

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部