支持块编辑距离的索引结构被引量：3

Index Structures for Supporting Block Edit Distance

下载PDF

导出

摘要在近似字符串匹配中,传统的编辑距离不能很好地衡量诸如人名、地址等数据的相似关系,而块编辑距离可以很好地衡量两个字符串的相似性.如何有效地支持块编辑距离,进行近似字符串查询处理具有重要的意义.计算两个字符串的块编辑距离是一个NP完全问题,因此希望提供有效的方法可以增强过滤能力,并减少假通过率.设计了一种支持移动编辑距离的新颖的索引结构SHV-Trie,通过研究移动编辑距离的操作特性,使用字母出现的频率作为支持移动编辑距离操作的一个下界,并且提出相应的查询过滤算法,同时,针对索引SHV-Trie的空间开销过大的问题,提出一种优化字母排列的索引结构和一种压缩的索引结构及相关查询过滤算法.真实数据集上的实验结果与分析显示了所提出的索引结构具有良好的过滤能力,并通过减少效率假通过率提高查询的效率. In approximate string matching,the traditional edit distance cannot evaluate the similarity between strings very well,especially for the name,address datasets,etc. The block edit distance,however,can do the job easily. It is important to efficiently support block edit distance for approximate string query processing. Since computing the block edit distance between two strings is an NP-Complete problem,it is desired to provide solutions to increase filterability and decrease false positives. In this paper,a novel index structure,called SHV-Trie,is proposed. A lower bound of the block edit distances is presented according to the features of the block edit distance with move operations,i.e. the frequencies of each character in a string. A corresponding query filter approach is proposed based on the lower bound on character frequencies. Meanwhile,considering the large space cost problem,an optimized ordered character index structure and a compressed index structure are proposed. The corresponding query filtering approaches are further given based on the optimized and compressed index structures. The experimental results and analysis on real data sets show that the proposed index structures can provide good filtering ability and high query performance by decreasing false positives.

作者王斌郭庆李中博杨晓春

机构地区东北大学信息科学与工程学院中国人民大学数据工程与知识工程教育部重点实验室

出处《计算机研究与发展》 EI CSCD 北大核心 2010年第1期191-199,共9页 Journal of Computer Research and Development

基金国家自然科学基金项目(60828004 60973018) 教育部新世纪优秀人才支持计划基金项目(NCET-06-0290) 中国人民大学数据与知识工程教育部重点实验室开放课题(2008002)

关键词近似字符串匹配块编辑距离压缩索引 NP完全问题 approximate string matching block edit distance compression index NP-complete problem

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Shapira D, Storer J A. Edit distance with move operations [C]//LNCS2373: Proe of Combinatorial Pattern Matching. Berlin: Springer, 2002:85-98.
2Shapira D, Storer J A. Edit distance with move operations [J]. Journal of Discrete Algorithms, 2007, 5(2) : 380-392.
3Croehemore M, Rytter W. Text Algorithms [M]. UK: Oxford University Press, 1995.
4Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computation [M]. Cambridge: Cambridge University Press, 1997.
5Crochemore M, Rytter W. Jewels of Stringology [M]. Singapore: World Scientific, 2002.
6Navarro G. A guided tour to approximate string matching [J]. ACM Computing Surveys, 2001, 33(1): 31-88.
7Lopresti D, Tomkins A. Block edit models for approximate string matching [J]. Theoretical Computer Science, 1997, 181(1): 159-179.
8Graham C, Muthukrishnan S. The string edit distance matching problem with moves [C]. //Proc of ACM-SIAM Symp on Discrete Algorithms. New York: ACM, 2002: 667-676.
9Graham C, Muthukrishnan S. The string edit distance matching problem with moves [J]. ACM Trans on Algorithms, 2007, 3(1): 2-21.
10范洪博,姚念民.一种高速精确单模式串匹配算法[J].计算机研究与发展,2009,46(8):1341-1348. 被引量：14

二级参考文献10

1Holub J,Durian B.Fast variants of bit parallel approach to suffix automata[OL]. http://www.cri.haifa.ac.il/events/2005/string/presentations/Holub.pdf . 2008
2Allauzen C,,Crochemore M,Raffinot M.Factor oracle:A newstructure for pattern matching[].Proc of SOFSEM.1999
3Navarro G,Raffinot M.Fast and flexible string matching by combining bit-parallelismand suffix automata[OL]. http://doi.acm.org/10.1145/351827.384246 . 2008
4Peltola Hannu,Tarhio Jorma.Alternative algorithms for bit-parallel string matching[].Proc of SPIRE.2003
5Knuth D E,Morris J H,Pratt V R.Fast pattern matching in string[].SIAM Journal on Computing.1977
6Horspool R N.Practical fast searching in strings[].Software Practice and Experience.1980
7Hume A,Sunday D M.Fast string searching[].Software -Practice &Experience.1991
8Sheik S.S,Aggarwal Sumit K,Poddar Anindya.A fast pattern matching algorithm[].Journal of Chemistry.2004
9K.Fredriksson,S. Grabowski.Practical and Optimal String Matching[].String Processing and Information Retrieval.2005
10Lecroq T.Fast exact string matching algorithm[].Information Processing Letters.2007

共引文献13

1何慧敏,刘燕兵,谭建龙,郭莉.一种基于子串识别的多模式串匹配算法[J].计算机应用与软件,2011,28(11):10-14. 被引量：1
2郑天明,王韬,郭世泽,李华,赵新杰.改进的空间协议识别算法[J].通信学报,2012,33(5):183-190. 被引量：6
3赵森严,黄伟,李阳铭.一种改进的KMP入侵检测的模式匹配算法[J].井冈山大学学报（自然科学版）,2013,34(1):55-57. 被引量：3
4张建,范洪博,黄青松,刘利军.基于非对齐双字节读机制的单模式串匹配算法[J].计算机工程,2013,39(12):157-161.
5刘燕兵,邵妍,王勇,刘庆云,郭莉.一种面向大规模URL过滤的多模式串匹配算法[J].计算机学报,2014,37(5):1159-1169. 被引量：13
6张萍,王建忠.一种基于大数据的有效搜索方法的改进[J].计算机应用研究,2014,31(8):2331-2333. 被引量：4
7王亚南,徐周波,古天龙.基于OBDD的模式匹配算法硬件实现[J].桂林电子科技大学学报,2016,36(3):204-209.
8赵晓,何立风,王鑫,姚斌,巢宇燕,王亚妮.一种高效的模式串匹配算法[J].陕西科技大学学报（自然科学版）,2017,35(1):183-187. 被引量：4
9徐周波,张永超,古天龙,宁黎华.面向入侵检测系统的模式匹配算法研究[J].计算机科学,2017,44(9):125-130. 被引量：7
10李成龙,杨冬菊,韩燕波.基于分词矩阵模型的模糊匹配查重算法研究[J].计算机科学,2017,44(B11):55-60. 被引量：4

同被引文献31

1赵作鹏,尹志民,王潜平,许新征,江海峰.一种改进的编辑距离算法及其在数据处理中的应用[J].计算机应用,2009,29(2):424-426. 被引量：51
2车万翔,刘挺,秦兵,李生.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-19. 被引量：64
3张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用[J].中文信息学报,2005,19(2):93-99. 被引量：34
4邹旭楷.汉字／字符串编辑距离和编辑路径的有效求解技术[J].计算机研究与发展,1996,33(8):574-580. 被引量：5
5Levenshtein VL. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR, 1966,163(4):707-710.
6Zhang YT, Liao FX, Zhang TY, Zhu XM. A novel method for the short message or multimedia message synchronization. Second International Conference on Wireless and Mobile Communications. 2006.10.
7Levenshtein VL. Binary codes capable of correcting deletions, insertions and reversals. Doldady Akadernii Nauk SSSR, 1966,163(4):707-710.
8Lowrance R, Wagner RA. An extension of the string to string correction problem. Journal of the ACM, 1975,22 (2):177-183.
9沈嘉懿,李芳,徐飞玉,Hans Uszkoreit.中文组织机构名称与简称的识别[J].中文信息学报,2007,21(6):17-21. 被引量：32
10李彬.计算字符串相似度的矩阵算法[J].现代电子技术,2007,30(24):106-108. 被引量：7

引证文献3

1钱苏林,李炜,王晶.一种基于特征值的短信过滤匹配算法[J].计算机系统应用,2012,21(5):55-62.
2刘月锟.基于约束的字符串相似度研究与应用[J].智能计算机与应用,2019,9(3):180-183.
3徐嘉康,张晨,王柳静,张贵军.非均权-动态规划地址匹配算法设计与实现[J].小型微型计算机系统,2022,43(3):530-535.

1满都呼,宋展.基于分层存储理论模型的近似字符串匹配并行算法研究[J].集成技术,2016,5(1):33-43.
2优派VX2771-Shv护眼显示器[J].计算机应用文摘,2015,0(16):46-47.
3李峰.计算机键盘字母排列揭秘[J].当代小学生（中高年级）,2009(3):50-50.
4尚永强,张琳梅,徐大伟.基于内容的音频检索算法[J].河南科技学院学报,2009,37(3):69-72. 被引量：2
5刘兵,扶晓,陈柳巍.字符串近似匹配查询技术综述[J].电脑编程技巧与维护,2012(8):114-115. 被引量：1
6石永革,张毫.基于BPM-BM过滤优化的近似字符串匹配算法[J].青岛科技大学学报（自然科学版）,2016,37(1):108-112. 被引量：1
7EditShare Field便携式共享存储系统[J].电视字幕．特技与动画,2007(4):73-73.
8刘兵,臧天阳,张晶.一种中文字符串近似匹配查询技术研究[J].电脑编程技巧与维护,2013(14):6-6.
9范立新,谢晓能,吴飞.基于过滤的中文多模式近似字符串匹配算法[J].计算机工程,2006,32(20):48-50. 被引量：5
10范立新.改进的中文近似字符串匹配算法[J].计算机工程与应用,2006,42(34):172-174. 被引量：8

计算机研究与发展

2010年第1期

浏览历史

内容加载中请稍等...

支持块编辑距离的索引结构被引量：3

参考文献11

二级参考文献10

共引文献13

同被引文献31

引证文献3

相关作者

相关机构

相关主题

浏览历史

支持块编辑距离的索引结构 被引量：3

参考文献11

二级参考文献10

共引文献13

同被引文献31

引证文献3

相关作者

相关机构

相关主题

浏览历史

支持块编辑距离的索引结构被引量：3