期刊文献+

基于改进型遗传神经网络的相似重复记录检测 被引量:3

Genetic Neural Network for Detection of Approximately Duplicate Records
下载PDF
导出
摘要 本文提出一种基于遗传神经网络的相似重复记录检测方法,充分利用了神经网络的非线性映射和遗传算法的全局优化特性,将基于学习的思想和进化的思想有效结合并应用到重复记录检测中,避开了传统方法计算属性权重的问题,并对遗传神经网络进行改进。实验结果表明本文方法能够有效地解决大数据量的相似重复记录检测问题,不仅具有好的检测精度,而且具有很好的时间效率。 This paper presents a genetic neural network for detection of approximately duplicate records by full use of non-linear mapping of neural networks and global optimization features of genetic algorithms.Learning-based ideas and the evolution of thinking is applied to the detection of duplicate records,avoiding the traditional method attribute weight problem.Experimental results show that this method can effectively solve the large data volume of approximately duplicated records detection of problems,not only has good detection accuracy,but also has good time efficiency.
出处 《计算机测量与控制》 CSCD 北大核心 2011年第5期1021-1023,共3页 Computer Measurement &Control
基金 河南省科技计划重点项目(102102210191) 河南省教育厅自然科学研究资助计划项目(2009A520013)
关键词 相似重复记录 遗传算法 神经网络 数据清洗 approximately duplicate records detection genetic algorithms neural network data cleaning
  • 相关文献

参考文献16

  • 1邱越峰,田增平,季文贇,周傲英.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24(1):69-77. 被引量:72
  • 2郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(11):2076-2082. 被引量:268
  • 3Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: A Survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19 (1) :1 - 16.
  • 4Huang L I, Jin H, Yuan P P, et al. Duplicate Records Cleansing with Length Filtering and Dynamic Weighting [A]. Fourth International Conference on Semantics, Knowledge and Grid [C]. 2008: 95 - 102.
  • 5Hernandez M, Stolfo S. The Merge Purge Problem for Large Databases[M]. New York, USA: ACM Press, 1995: 127-138.
  • 6Monge A E, Elkan C R An Efficient Domain--independent Algorithm for Detecting Approximately Duplicate Database Records [A].Proc. of Workshop on Research Issues on Data Mining and Knowledge Discovery [C]. Tucson, Arizona, USA. 1997: 23 - 29.
  • 7Gravano L, Ipeirotis P G Using Q grams in DBMS for Approximate String Processing [J].IEEE Data Engineering Bulletin, 2001, 24 (4): 28-34.
  • 8Lee M L, Lu Hongjun, Ling T wet al. Cleansing Data for Mining and Warehousing [A].Proc of the 10th Int Conf on Database and Exper Systems Applications [C]. Florence, Italy: 1999. 751 - 760.
  • 9张昌年.一种基于VSM的检测相似重复记录的方法[J].微电子学与计算机,2008,25(8):184-187. 被引量:10
  • 10韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212. 被引量:32

二级参考文献91

共引文献641

同被引文献34

  • 1赵作鹏,尹志民,王潜平,许新征,江海峰.一种改进的编辑距离算法及其在数据处理中的应用[J].计算机应用,2009,29(2):424-426. 被引量:51
  • 2王小华,卢小康.基于N-Gram的文本去重方法研究[J].杭州电子科技大学学报(自然科学版),2010,30(2):61-64. 被引量:5
  • 3Pahwa P,Arora R,Thakur G.An efficient algorithm for data cleaning[J].International Journal of Knowledge-Based Organizations(IJKBO),2011(4):56-71.
  • 4Gravano L.Using Q-grams in fl DBMS for approximate string processing[J].IEEE Transactions on Knowledge and Data Engineering,2001,24(4):28-34.
  • 5Hernandez M A,Stolfo S J.Real-world data is dirty:data cleansing and the merge/purge problem[J].Data Mining and Knowledge Discovery,1998,2(1):9-37.
  • 6Hernandez M,Stolfo S.The merge/purge problem for large databases[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data,San Jose,California,1995:127-138.
  • 7Zhang Zhongnan,He Ling,Tan Yize,et al.A heuristic approximately duplicate records detection algorithm based on attributes analysis[J].International Journal of Digital Content Technology&its Applications,2012,6(4):259-267.
  • 8Liu X,Li G,Feng J,et al.Effective indices for efficient approximate string search and similarity join[C]//Proceedings of the 9th International Conference on Web-Age Information Management,Zhangjiajie,China,2008:127-134.
  • 9李星毅,包从剑,施化吉.数据仓库中的相似重复记录检测方法[J].电子科技大学学报,2007,36(6):1273-1277. 被引量:25
  • 10寇月,申德荣,李冬,聂铁铮.一种基于语义及统计分析的DeepWeb实体识别机制[J].软件学报,2008,19(2):194-208. 被引量:18

引证文献3

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部