期刊文献+

基于多重索引模型的大规模词典近似匹配算法 被引量:5

An Approximate Matching Algorithm for Large Scale Lexicons
下载PDF
导出
摘要 编辑器的拼写校正、搜索引擎的查询纠正、光学字符识别的结果检查等领域都用到词典近似匹配算法.传统单索引模式很难在高性能的前提下保证高召回率.词典越大问题越严重.提出了大规模词典近似匹配的多重索引模型,首先将背景词典根据单词长度划分为若干子词典,对各子词典按照一定策略建立unigram,bigram,trigram,quadgram中的一种或若干种索引,当查找用户模式P的近似匹配时,根据模式P检索特定N-gram索引链,从而得到候选近似匹配集合C,对C中每一个单词W,计算P与W的编辑距离即可输出P的所有最终匹配结果R.实验表明,基于多重索引模型的词典近似匹配算法能够大幅度减少候选近似匹配结果的数量,从而提高词典近似匹配的速度. Approximate lexicon matching is widely used for spelling correction of editors, query suggestion of search engines, post-processing of optical character recognizers and other applications. It is quite difficult for traditional single index schemes to obtain high recall and high performance at the same time. When the background lexicon becomes large, things go from worse to worst. A multiple-indices scheme is presented to handle this problem. The background dictionary is partitioned into several sub-dictionaries, each of which shares the same word length. Unigram, bigram, trigram and quadgram indices are constructed for each sub-dictionary if needed. As for the input pattern P, appropriate matching policies are deployed to obtain the set C of candidate matches, the edit distance between P and each element W in C is then computed, and the final approximate matches are then engendered. When a longer pattern P is queried, only indices of longer n-gram will be used to engender the candidate matches. What's more, for each query pattern P, only a very small proportion of the input lexicon will be checked, so that this approximate matching scheme is quite efficient than traditional single index schemes. Experiments show that the number of candidate matches is much less so that the matching speed is much more promising.
出处 《计算机研究与发展》 EI CSCD 北大核心 2008年第10期1776-1781,共6页 Journal of Computer Research and Development
基金 国家“九七三”重点基础研究发展规划基金项目(2004CB318109,2007CB311100) 国家“八六三”高技术研究发展计划基金项目(2006AA010105,2007AA01Z416)~~
关键词 模式匹配 近似匹配 多重索引模型 大规模词典 拼写检查 pattern matching approximate matching multiple indices scheme large scale lexicon spelling correction
  • 相关文献

参考文献11

  • 1Zobel J, Dart P. Finding approximate matches in large lexicons [J]. Software Practice and Experience, 1995, 25 (3): 331-345
  • 2Deorowicz S, Ciura M. Correcting spelling errors by modeling their causes[J]. Intternational Journal of Applied Mathematics and Computer Science, 2005, 15(2) : 275-285
  • 3Wilbur W, Kim W, Xie N. Spelling correction in the PubMed search engine[J]. Information Retrieval, 2006, (9) : 543-564
  • 4Strohmaier C, Ringlstetter C, Schulz K, etal. A visual and interactive tool for optimizing lexical post correction of OCR results [C] //Proc of the IEEE Workshop on Document Image Analysis and Recognition. Los Alamitos: IEEE Computer Society, 2003
  • 5Mihov S, Schulz K. Fast approximate search in large dictionaries [J]. Computational Linguistics, 2004, 30 (4): 451-477
  • 6Bunke B. A fast algorithm for finding the nearest neighbor of a word in a dictionary [C]//Proc of the 2nd Int Conf on Document Analysis and Recognition. Los Alamitos: IEEE Computer Society, 1993:632-637
  • 7Wagner R, Fischer M. The string-to-string correction problem[J]. Journal of theACM, 1974, 21(1): 168-173
  • 8Schulz K, Mihov S. Fast string correction with Levenshtein automata [J]. Intternational Journal of Document Analysis and Recognition, 2002, 5(1): 67-85
  • 9Oflazer K. Error-tolerant finite state recognition with applications to morphological analysis and spelling correction [J]. Computational Linguistics, 1996, 22(1) : 73-89
  • 10Mihov S, Koeva S. Precise and efficient text correction using Levenshtein automata, dynamic Web dictionaries and optimized correction models [C] //Proc of the 1st Int Workshop on Proofing Tools and Language Technologies. Patras, Greece: Patras University, 2003

同被引文献33

  • 1范立新.改进的中文近似字符串匹配算法[J].计算机工程与应用,2006,42(34):172-174. 被引量:8
  • 2LIU F,YIN C,LIU S.Regional networked manufacturing system[J].Chinese Journal of Mechanical Engineering,2000,13(Supp):97-103.
  • 3国民经济行业分类与代码(GB/T 4754-2002)[M].北京:中国标准出版社,2007.
  • 4方志坚,张瑞林,童小素.搜索引擎综合分析[J].计算机工程与设计,2007,28(16):4038-4041. 被引量:18
  • 5GNU Aspell. [EB/OL]. [2011-10-11]. http://aspell.net.
  • 6Schulz K, Mihov S. Fast string correction with Levenshtein au- tomata [J]. International Journal of Document Analysis and Recog- nition, 2002, 5(1): 67-85.
  • 7Wagner R A. The String-to-String Correction Problem [J]. Journal of the ACM, 1974, 21(1): 168-173.
  • 8LEVENSHTEIN IV. Binary codes capable of correcting dele- tions, insertions, and reversals [J]. Soviet Physiscs Doklady, 1966, 10(8): 707-710.
  • 9CHANG Y I,CHEN J R,HSU M T. A Hash Trie filter method for approximate string matching in genomic databases[J].Applied Intelligence,2010,(01):21-38.
  • 10BHUKYA R,SOMAYAJULU D V L N. 2-jump DNA search multiple pattern matching algorithm[J].International Journal of Computer Science Issues,2011,(03):320-329.

引证文献5

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部