期刊文献+

对基于MPN的相似重复记录识别算法的改进 被引量:6

Improvement for the Algorithm of Detecting Approximately Duplicate Database Records Based on MPN
下载PDF
导出
摘要 相似重复记录识别是数据清理中的一个关键问题。文章针对常用的多趟邻接排序法提出了两点改进:一是在多趟排序识别过程中直接合并有重叠的相似记录集,取消了最后计算传递闭包的环节;二是利用关键字按字典序排序的特性,在求编辑距离之前先过滤前面的公共子串,减少了相似记录比较的开销。文章最后给出了改进算法与原算法的对比试验结果。 Detecting approximately duplicate database records is an important task in data cleaning. A new duplicate detection methods was proposed in this paper which improved the familiar MPN method in two ways. Firstly, the step of computing transitive closure was canceled by directly uniting the overlapping similar record sets. Secondly, the cost of records comparing was reduced by filtering the former common substring before computing the edit distance of two keywords. The experimental results were given out between the imoroved algorithm and MPN,
作者 刘伟 曹先彬
出处 《微计算机信息》 北大核心 2005年第08X期147-149,3,共4页 Control & Automation
关键词 数据清理 相似重复记录 字符串匹配 MPN 传递闭包 Data cleaning, Approximately duplicate databaserecords, String matching, MPN, Transitive closure.
  • 相关文献

参考文献1

  • 1Erhard Rahm, Hong Hai Do. Data Cleaning: Problems and Current Approaches[J].IEEE Data Eng Bull,2000,23(4):3-13.

同被引文献38

引证文献6

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部