期刊文献+

基于遗传神经网络的相似重复记录检测方法 被引量:13

Approximately duplicate record detection method based on neural network and genetic algorithm
下载PDF
导出
摘要 为了有效解决数据清洗领域中相似重复记录的检测问题,提出了一种基于遗传神经网络的相似重复记录检测方法。该方法计算两条记录对应字段间的相似度,构建基于神经网络的检测模型,利用遗传算法对网络模型的权值进行优化,使用遗传神经网络组合多个字段上的相似度来检测相似重复记录。在不同领域数据集上的测试结果表明,该方法能够提高相似重复记录检测的准确率和检测精度。 In order to solve the problem of approximately duplicate record detection in the field of data cleaning effectively,a method based on neural network and genetic algorithm is proposed.Firstly,this method measures the similarity of each corresponding field pairs in the two records.Then a model based on neural network for detection is constructed,and genetic algorithm is adopted to optimize the weights of the neural network model.Finally,the neural network trained on some samples is used to classify the record pair in duplicate or non-duplicate one.Experimental results on a range of datasets show that this method improves the accuracy and precision of duplicate detection over traditional methods.
出处 《计算机工程与设计》 CSCD 北大核心 2010年第7期1550-1553,共4页 Computer Engineering and Design
基金 国家863高技术研究发展计划基金项目(2009AAJ127)
关键词 相似重复记录检测 神经网络 遗传算法 数据清洗 数据集成 approximately duplicate record detection neural network genetic algorithm data cleaning data integration
  • 相关文献

参考文献6

二级参考文献31

  • 1林国玺,宣慧玉.遗传算法和BP人工神经网络在税收预测中的应用[J].系统工程理论方法应用,2005,14(2):145-148. 被引量:19
  • 2张乃禄,薛朝妹,徐竟天,张家田.原油含水率测量技术及其进展[J].石油工业技术监督,2005,21(11):25-28. 被引量:51
  • 3[1]Bitton D, DeWitt D J. Duplicate record elimination in large data files. ACM Trans Database Systems, 1983, 8(2):255-65
  • 4[2]Hernandez M, Stolfo S. The Merge/Purge problem for large databases. In: Proc ACM SIGMOD International Conference on Management of Data, 1995. 127-138
  • 5[3]Howard B Newcombe, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130:954-959
  • 6[4]DeWitt D J, Naught J F, Schneider D A. An evaluation of non-equijoin algorithms. In: Proc 17th International Conference on Very Large Databases, Barcelona, Spain, 1991. 443-452
  • 7[5]Hylton J A. Identifying and merging related bibliographic records[MS dissertation]. MIT: MIT Laboratory for Computer Science Technical Report 678, 1996
  • 8[6]Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc DMKD'97, Tucson Arizona, 1997
  • 9[7]Kukich K. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4):377-439
  • 10[8]Wagner R A, Fischer M J. The string-to-string correction problem. J ACM, 1974, 21(1):168-173

共引文献109

同被引文献118

引证文献13

二级引证文献53

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部