期刊文献+

基于匹配区域特征的相似字符串匹配过滤算法 被引量:10

A Filter Algorithm for Approximate String Matching Based on Match-Region Features
下载PDF
导出
摘要 相似字符串匹配过滤算法因其适合大库查找而被广泛应用,为通过提高过滤算法的过滤效率加快匹配速度,提出一种基于匹配区域特征的过滤算法.该算法将模式串和文本串分割成固定长度为kq+1的逻辑块,并从各块中提取了2个新的匹配区域特征:q-gram命中的均匀性和q-gram有效命中的区域性.新算法利用这些新特征优化了传统过滤标准,提高了算法的过滤效率;并改进了QUASAR中基于分块策略的过滤区确定方案.实验结果表明,新算法与改进前相比有效地加快了匹配速度,尤其在误差率较小时改进效果更佳. Approximate string matching is a basic problem in computer science. It is widely used in various areas. The aim of this study is to improve the speed of approximate string matching. Filter algorithm for approximate string matching is discussed because it is suitable for large-scale text searching. A novel filter algorithm based on match-region features is presented. Firstly, a q-gram index is used to preproeess text. Secondly, both pattern and text are logically divided into blocks of fixed size kq+1, and then new match-region features are extracted from blocks, and the algorithm optimizes the fundamental q-gram filtration criterion by the new features. Finally, the improved method of choosing filtration-region based on QUASAR's block addressing is used for filtration. The experimental results demonstrate that the algorithm achieves higher matching speed than that of QUASAR's block addressing by way of improving filtration efficiency. In particular, its matching speed is much faster under low error rate. Experiments also reveal the relationship between matching speed and error rate of new algorithm. These results suggest that the algorithm is useful in a system for approximate string matching with low error rate. It is also powerful for long pattern approximate string matching on the condition of fixed edit distance k.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第4期663-670,共8页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展计划基金项目(2006CB303000) 国家自然科学基金重点项目(60736016) 国家自然科学基金项目(60573045 60873198 60973113 60973128) 国家"九七三"重点基础研究发展计划基金前期研究专项项目(2009CB326202) 高等学校博士学科点专项科研基金项目(20050532007)~~
关键词 相似字符串匹配 过滤算法 匹配区域特征 过滤效率 q-gram approximate string matching filter algorithm match-region feature filtration efficiency q-gram
  • 相关文献

参考文献23

  • 1邹旭楷.汉字/字符串编辑距离和编辑路径的有效求解技术[J].计算机研究与发展,1996,33(8):574-580. 被引量:5
  • 2Burkhardt S.Filter algorithms for approximate string matching[D].Saarbrücken,Saarland,Germany:Department of Computer Science,Saarland University,2002.
  • 3Navarro G.A guided tour to approximate string matching[J].ACM Computing Surveys,2001,33(1):31-88.
  • 4Wu Sun,Manber U,Myers G.A sub-quadratic algorithm for approximate limited expression matching[J].Algorithmica,1996,15(1):50-67.
  • 5Baeza-Yates R,Navarro G.A faster algorithm for approximate string matching[C]//Proc of the 7th Annual Symp on Combinatorial Pattern Matching.Berlin:Springer,1996:1-23.
  • 6Navarro G,Baeza-Yates R,Sutinen E,et al.Indexing methods for approximate string matching[J].IEEE Data Engineering Bulletin,2001,24(4):19-27.
  • 7Tian Yuanyuan,Tata S,Hankins R A,et al.Practical methods for constructing suffix trees[J].VLDB Journal,2005,14(3):281-299.
  • 8Gonnet G,Baeza-Yates R,Snider T.Information Retrieval:Data Structures and Algorithms[M].Upper Saddle River,New Jersey:Prentice Hall,1992.
  • 9Lok-Lam C,Cheung D W,Siu-Ming Y.Approximate string matching in DNA sequences[C]//Proc of the 8th Int Conf on Database Systems for Advanced Applications.Piscataway,NJ:IEEE,2003:303-310.
  • 10Navarro G,Baeza-Yates R.A practical q-gram index for text retrieval allowing errors[J/OL].CLEI Electronic Journal,1998,1(2):1-16.http://www.clei.cl/cleiej/papers/vli2p3.ps.gz.

二级参考文献1

共引文献4

同被引文献116

引证文献10

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部