摘要
正则表达式搜索算法的性能与从非确定性有限状态自动机(NFA)的初始状态到终止状态的最短路径Lmin成正比,与正则表达式所表达的语言的前缀集合Pref(RE)成反比,而一般情况下Pref(RE)较大,确定Pref(RE)中的元素在目标文本中的出现位置比较困难.文中提出了一种基于Bloom Filter的正则表达式集合搜索算法,此算法利用BloomFilter集合查询时间与集合大小无关的特点,可以快速准备定位Pref(RE)的出现位置,使得搜索速度不受Pref(RE)的影响,如果采用多个Bloom Filter并行,还可以间接增大Lmin.分析与测试结果表明,该算法较大地加快了正则表达式的搜索速度,对于正则表达式集合,算法性能改善尤其明显,在Lmin较长、Pref(RE)较大时,搜索速度可以提高数倍至数十倍,适合大规模的多正则表达式的快速搜索.
The effectiveness of the regular expression searching algorithms are proportional to the shortest path Lmin from the initial state to the final state of NFA and is inversely proportional to the prefix set Pref(RE) of the language that denotes the regular expression. In general, the elements in Pref(RE) are difficult to locate in the target text because the set of Pref(RE) is large. Proposed in this paper is a regular expression searching algorithm based on the Bloom Filter of which computation time to perform the query is independent of the string number. The proposed algorithm can fast locate Pref(RE) and perform a search with the speed immune from Pref(RE) , and, particularly, when multiple parallel Bloom Filters are employed, the algorithm may indirectly lengthen the shortest path. Analysis and experimental results indicate that the proposed algorithm greatly accelerates the search of regular expressions, especially for the search of an regular expression set, and that the searching speed increases several times and even up to tens of times when Lmin and Pref(RE) values are both large. It is thus concluded that the proposed algorithm is suitable for the fast search of multiple regular expressions on a large scale.
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2009年第4期37-41,共5页
Journal of South China University of Technology(Natural Science Edition)
基金
中国博士后自然科学基金资助项目(2005037582)
粤港关键领域重点突破项目(2005A10307007)