期刊文献+

一种融合多种编辑距离的字符串相似度计算方法 被引量:41

New method of character string similarity compute based on fusing multiple edit distances
下载PDF
导出
摘要 针对中西文混合字符串,采用了将汉字作为西文字符的等价单位计算编辑距离的方法,并从输入法的角度提出了采用拼音编码和五笔编码计算编辑距离的方法,最后给出了融合三种编辑距离计算字符串相似度的算法。仿真结果表明,该方法在提高相似重复记录检测的查全率的同时,也能获得较高的查准率。 The Chinese character is treated as the equivalent of western character when computing edit distance of strings composed of Chinese and western characters. Considering from Chinese input methods,this paper proposed a new way to calculate edit distance based on PinYin code and WuBi code of Chinese character. Also proposed the algorithm of fusing three edit distances to get string similarity. Experiment results show that the new method can improve the recall rate of approximately duplicate records detection,besides getting high precision rate.
出处 《计算机应用研究》 CSCD 北大核心 2010年第12期4523-4525,共3页 Application Research of Computers
基金 中国博士后科学基金资助项目(20090461425) 江苏省博士后科研资助计划项目(0901014B)
关键词 数据清洗 相似重复记录 字符串匹配 字符串相似度 编辑距离 data cleaning approximately duplicate records string matching string similarity edit distance
  • 相关文献

参考文献7

  • 1曹建军 刁兴春 杜鷁等.信息质量研究框架概述.现代军事通信,2009,17(4):55-62.
  • 2俞荣华,田增平,周傲英.一种检测多语言文本相似重复记录的综合方法[J].计算机科学,2002,29(1):118-121. 被引量:26
  • 3邱越峰,田增平,季文贇,周傲英.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24(1):69-77. 被引量:72
  • 4LEE M L,LING T W,LOW W L.IntelliClean:a knowledge-based intelligent data cleaner[C] //Proc of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston:ACM Press,2000:290-294.
  • 5LEVENSHTEIN V I.Binary codes capable of correcting deletions,insertions and reversals[J].Soviet Physics Doklady,1966,10:707-710.
  • 6LIANG Jin,CHEN Li,MEHROTRA S.Efficient record linkage in large data sets[C] //Proc of the 8th International Conference on Database System for Advanced Application.2003:137-146.
  • 7MONGE A E,ELKAN C P.An efficient domain-independent algorithm for detection approximately duplicate database records[C] //Proc of DMKD'97.1997:23-29.

二级参考文献20

  • 1[1]Bitton D, DeWitt D J. Duplicate record elimination in large data files. ACM Trans Database Systems, 1983, 8(2):255-65
  • 2[2]Hernandez M, Stolfo S. The Merge/Purge problem for large databases. In: Proc ACM SIGMOD International Conference on Management of Data, 1995. 127-138
  • 3[3]Howard B Newcombe, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130:954-959
  • 4[4]DeWitt D J, Naught J F, Schneider D A. An evaluation of non-equijoin algorithms. In: Proc 17th International Conference on Very Large Databases, Barcelona, Spain, 1991. 443-452
  • 5[5]Hylton J A. Identifying and merging related bibliographic records[MS dissertation]. MIT: MIT Laboratory for Computer Science Technical Report 678, 1996
  • 6[6]Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc DMKD'97, Tucson Arizona, 1997
  • 7[7]Kukich K. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4):377-439
  • 8[8]Wagner R A, Fischer M J. The string-to-string correction problem. J ACM, 1974, 21(1):168-173
  • 9[9]Lowrance R, Robert A Wagner. An extension of the string-to-string correction problem. J ACM, 1975, 22(2):177-183
  • 10[10] Sellers P H. On the theory and computation of evolutionary distances. SIAM J Applied Mathematics, 1974, 26(4):787-793

共引文献86

同被引文献343

引证文献41

二级引证文献231

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部