期刊文献+

基于伸缩窗口和等级调整的SNM改进方法 被引量:14

Amelioration method of SNM based on flexible window and ranking adjusting
下载PDF
导出
摘要 对基本邻近排序算法(basic sorted-neighborhood method,SNM)进行分析,指出其不足,提出了SNM算法的一种改进方法。采用变步长伸缩窗口,动态改变检测窗口大小,避免漏配问题,并减少不必要的比较。采用动态调整等级法,根据记录相似度调整字段等级,并通过等级法将字段等级转换为权重,解决了人为赋予固定权重主观性强、不准确的问题。通过对实际系统中的数据进行测试,验证了方法的有效性和优越性。同时,这两种方法适用于大多数基于排序—合并的相似重复记录检测方法,提高了相应方法的效率和准确度。 This paper introduced the basic SNM and analyzed its deficiency, and put forward a amelioration method of SNM. To avoid missing comparison and reduce excrescent comparison, adopting changing flexible window method, which changed the size of window dynamically. Solved the problem of subjectivity and inaccurate with fixed field weight offer by man-made by using dynamic adjusting ranking method, and according the record similarity to adjust the rank of field. And it transferred the ranking of field to weight by rank-based weights method. The experiments on the data from the information system show the ef- fectiveness and advantage of the proposed method. At the same time, these two methods are the same with other approximately duplicate records examine methods which based on sorted-merge, advancing the efficiency and accuracy.
出处 《计算机应用研究》 CSCD 北大核心 2013年第9期2736-2739,共4页 Application Research of Computers
基金 中国博士后科学基金特别资助项目(201003797) 江苏省博士后科研资助计划项目(0901014B) 解放军理工大学预研基金项目(20110604)
关键词 数据清洗 相似重复记录 变步长伸缩窗口 动态调整等级 基本邻近排序算法 data cleaning approximately duplicate records changing step flexible window dynamic adjusting ranking SNM algorithm
  • 相关文献

参考文献12

  • 1韩京宇,徐立臻,董逸生.数据质量研究综述[J].计算机科学,2008,35(2):1-5. 被引量:102
  • 2BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures[ C]//Proc of the 9th ACM SIGK- DD International Conference on Knowledge Discovery and Data Min- ing. Washington DC :ACM Press,2003:39-48.
  • 3CHANDEL A, HASSANZADEH O, KOUDAS N, et al. Benchmar- king declarative approximate selection predicates [ C ]//Proc of ACM SIGMOD International Conference on Management of Data. [ S. 1. ] : ACM Press,2007:353-364.
  • 4SARAWAGI S, COHEN W W. Semi-Markov conditional random fields for information extraction [ C ]//Advances in Neural Information Processing Systems. 2004.
  • 5VIOLA P, NARASIMHAN M. Learning to extract information from semi-strnctured text using a discriminative context free grammar [ C]//Proc of the 28th Annual International ACM SIGIR Conference on Research and Development in information Retrieval. [ S. 1. ] : ACM Press,2005:330- 337.
  • 6庞雄文,姚占林,李拥军.大数据量的高效重复记录检测方法[J].华中科技大学学报(自然科学版),2010,38(2):8-11. 被引量:15
  • 7王宏志,樊文飞.复杂数据上的实体识别技术研究[J].计算机学报,2011,34(10):1843-1852. 被引量:19
  • 8鲁均云,李星毅,施化吉,马素琴.基于内码序值聚类的相似重复记录检测方法[J].计算机应用研究,2010,27(3):874-878. 被引量:8
  • 9HERNANDEZ M A, STOLFO S J. Real-world data is dirty: data cleansing and the merge/purge problem [ J ]. Data Mining and Knowledge Discovery,1998,2( 1 ) :9-37.
  • 10陈伟,王昊,朱文明.一种提高相似重复记录检测精度的方法[J].计算机应用与软件,2006,23(10):29-30. 被引量:8

二级参考文献162

共引文献154

同被引文献101

引证文献14

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部