基于多重索引模型的大规模词典近似匹配算法被引量：5

An Approximate Matching Algorithm for Large Scale Lexicons

下载PDF

导出

摘要编辑器的拼写校正、搜索引擎的查询纠正、光学字符识别的结果检查等领域都用到词典近似匹配算法.传统单索引模式很难在高性能的前提下保证高召回率.词典越大问题越严重.提出了大规模词典近似匹配的多重索引模型,首先将背景词典根据单词长度划分为若干子词典,对各子词典按照一定策略建立unigram,bigram,trigram,quadgram中的一种或若干种索引,当查找用户模式P的近似匹配时,根据模式P检索特定N-gram索引链,从而得到候选近似匹配集合C,对C中每一个单词W,计算P与W的编辑距离即可输出P的所有最终匹配结果R.实验表明,基于多重索引模型的词典近似匹配算法能够大幅度减少候选近似匹配结果的数量,从而提高词典近似匹配的速度. Approximate lexicon matching is widely used for spelling correction of editors, query suggestion of search engines, post-processing of optical character recognizers and other applications. It is quite difficult for traditional single index schemes to obtain high recall and high performance at the same time. When the background lexicon becomes large, things go from worse to worst. A multiple-indices scheme is presented to handle this problem. The background dictionary is partitioned into several sub-dictionaries, each of which shares the same word length. Unigram, bigram, trigram and quadgram indices are constructed for each sub-dictionary if needed. As for the input pattern P, appropriate matching policies are deployed to obtain the set C of candidate matches, the edit distance between P and each element W in C is then computed, and the final approximate matches are then engendered. When a longer pattern P is queried, only indices of longer n-gram will be used to engender the candidate matches. What＇s more, for each query pattern P, only a very small proportion of the input lexicon will be checked, so that this approximate matching scheme is quite efficient than traditional single index schemes. Experiments show that the number of candidate matches is much less so that the matching speed is much more promising.

作者龚才春黄玉兰许洪波白硕

机构地区中国科学院计算技术研究所北京市计算中心

出处《计算机研究与发展》 EI CSCD 北大核心 2008年第10期1776-1781,共6页 Journal of Computer Research and Development

基金国家“九七三”重点基础研究发展规划基金项目(2004CB318109,2007CB311100) 国家“八六三”高技术研究发展计划基金项目(2006AA010105,2007AA01Z416)~~

关键词模式匹配近似匹配多重索引模型大规模词典拼写检查 pattern matching approximate matching multiple indices scheme large scale lexicon spelling correction

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献11

1Zobel J, Dart P. Finding approximate matches in large lexicons [J]. Software Practice and Experience, 1995, 25 (3): 331-345
2Deorowicz S, Ciura M. Correcting spelling errors by modeling their causes[J]. Intternational Journal of Applied Mathematics and Computer Science, 2005, 15(2) : 275-285
3Wilbur W, Kim W, Xie N. Spelling correction in the PubMed search engine[J]. Information Retrieval, 2006, (9) : 543-564
4Strohmaier C, Ringlstetter C, Schulz K, etal. A visual and interactive tool for optimizing lexical post correction of OCR results [C] //Proc of the IEEE Workshop on Document Image Analysis and Recognition. Los Alamitos: IEEE Computer Society, 2003
5Mihov S, Schulz K. Fast approximate search in large dictionaries [J]. Computational Linguistics, 2004, 30 (4): 451-477
6Bunke B. A fast algorithm for finding the nearest neighbor of a word in a dictionary [C]//Proc of the 2nd Int Conf on Document Analysis and Recognition. Los Alamitos: IEEE Computer Society, 1993:632-637
7Wagner R, Fischer M. The string-to-string correction problem[J]. Journal of theACM, 1974, 21(1): 168-173
8Schulz K, Mihov S. Fast string correction with Levenshtein automata [J]. Intternational Journal of Document Analysis and Recognition, 2002, 5(1): 67-85
9Oflazer K. Error-tolerant finite state recognition with applications to morphological analysis and spelling correction [J]. Computational Linguistics, 1996, 22(1) : 73-89
10Mihov S, Koeva S. Precise and efficient text correction using Levenshtein automata, dynamic Web dictionaries and optimized correction models [C] //Proc of the 1st Int Workshop on Proofing Tools and Language Technologies. Patras, Greece: Patras University, 2003

同被引文献33

1范立新.改进的中文近似字符串匹配算法[J].计算机工程与应用,2006,42(34):172-174. 被引量：8
2LIU F,YIN C,LIU S.Regional networked manufacturing system[J].Chinese Journal of Mechanical Engineering,2000,13(Supp):97-103.
3国民经济行业分类与代码(GB/T 4754-2002)[M].北京:中国标准出版社,2007.
4方志坚,张瑞林,童小素.搜索引擎综合分析[J].计算机工程与设计,2007,28(16):4038-4041. 被引量：18
5GNU Aspell. [EB/OL]. [2011-10-11]. http://aspell.net.
6Schulz K, Mihov S. Fast string correction with Levenshtein au- tomata [J]. International Journal of Document Analysis and Recog- nition, 2002, 5(1): 67-85.
7Wagner R A. The String-to-String Correction Problem [J]. Journal of the ACM, 1974, 21(1): 168-173.
8LEVENSHTEIN IV. Binary codes capable of correcting dele- tions, insertions, and reversals [J]. Soviet Physiscs Doklady, 1966, 10(8): 707-710.
9CHANG Y I,CHEN J R,HSU M T. A Hash Trie filter method for approximate string matching in genomic databases[J].Applied Intelligence,2010,(01):21-38.
10BHUKYA R,SOMAYAJULU D V L N. 2-jump DNA search multiple pattern matching algorithm[J].International Journal of Computer Science Issues,2011,(03):320-329.

引证文献5

1谢大吉.基于Java的网络制造资源主题信息采集模块设计[J].计算机工程与设计,2010,31(19):4209-4212. 被引量：1
2葛慧丽,叶志飞.一种基于迭代运算引文排序的科技文献检索系统[J].计算机时代,2011(9):15-18. 被引量：1
3李健豪,章品正.相似单词查找方法研究与实现[J].微计算机信息,2012(9):417-418. 被引量：3
4黄国林,郭丹,胡学钢.求解近似模式匹配的启发式算法[J].计算机科学与探索,2013,7(1):83-91.
5黄国林,郭丹,胡学钢.基于通配符和长度约束的近似模式匹配算法[J].计算机应用,2013,33(3):800-805. 被引量：5

二级引证文献10

1张浩,侯宝剑,叶明全.求解PMWOC问题的算法[J].安徽师范大学学报（自然科学版）,2014,37(3):242-246.
2沈璐,纪允,纪冬宝,李萍.带可变长度通配符的模式匹配算法[J].计算机工程与应用,2015,51(15):43-47.
3张浩,叶明全.求解PMWOC问题的位并行算法[J].计算机应用研究,2015,32(10):2973-2977.
4何锋,谷锁林,陈彦辉.基于编辑距离相似度的文本校验技术研究与应用[J].飞行器测控学报,2015,34(4):389-394. 被引量：12
5张游杰,马俊明,张清萍.基于文件比较的电子公文痕迹保留方法[J].计算机应用与软件,2016,33(3):118-120.
6汪浩,王驰.改进的带可变长度通配符的近似模式串匹配算法[J].南京理工大学学报,2016,40(6):687-693.
7张婧,刘彦君,范漪萍,贾明慧.国内网络信息采集研究现状述评[J].科技管理研究,2017,37(9):260-266. 被引量：5
8吴东根,周小安.基于最长公共子序列的DNA序列相似性分析[J].智能计算机与应用,2018,8(6):22-26. 被引量：2
9曹玥,贾砚池,王峥.基于语义的科技文献检索技术研究[J].微型电脑应用,2019,35(12):16-18. 被引量：2
10李卓轩,赵璇,曹进德,储越.政务服务中群众留言答复意见评价模型[J].南京信息工程大学学报（自然科学版）,2022,14(2):178-185. 被引量：3

1李海涛.基于多级相关图的大规模词典完美哈希函数构造算法[J].计算机工程与科学,2010,32(12):128-133. 被引量：1
2胡熠,陆汝占,李学宁,段建勇,陈玉泉.基于语言建模的文本情感分类研究[J].计算机研究与发展,2007,44(9):1469-1475. 被引量：23
3高红,黄德根,杨元生.一种与分词一体化的中文人名识别方法[J].计算机工程,2006,32(19):9-10. 被引量：2
4王桐,刘大昕,田迪,孙伟,张万松.一种改进的XML向量空间模型及其近似匹配算法[J].计算机研究与发展,2006,43(z3):401-406.
5王笑旻.基于Bigram的特征词抽取及自动分类方法研究[J].计算机工程与应用,2005,41(22):177-179. 被引量：5
6孙德才,王晓霞.一种支持多种子近似串匹配的q-gram索引[J].计算机科学,2014,41(9):279-284. 被引量：3
7赵敏涯.结合语言模型的自动文本分类的应用研究[J].计算机与现代化,2010(3):141-143.
8胡军强,杜平,李津生,洪佩琳.数字通信系统设计中FPGA的仿真[J].电路与系统学报,2003,8(4):137-140. 被引量：5
9黄荣喜.基于中文字符串匹配算法的考试系统[J].计算机光盘软件与应用,2013,16(13):261-261. 被引量：1
10赵莉.基于OCR的拼写校正系统[J].兵工自动化,2010,29(9):92-94. 被引量：3

计算机研究与发展

2008年第10期

浏览历史

内容加载中请稍等...

基于多重索引模型的大规模词典近似匹配算法被引量：5

参考文献11

同被引文献33

引证文献5

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

基于多重索引模型的大规模词典近似匹配算法 被引量：5

参考文献11

同被引文献33

引证文献5

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

基于多重索引模型的大规模词典近似匹配算法被引量：5