摘要
在串匹配搜索中,字符串常常采用U-不确定串、V-不确定串及其结合的U-V-不确定串.如何识别巨量U-不确定字符串、V-不确定字符串和U-V-不确定字符串,以及两个和两个以上U-V-不确定字符串的交错情况的串匹配,是没有遗漏地检测有害信息的关键问题.本文提出一个快速检测巨量U-不确定字符串、巨量V-不确定字符串和巨量U-V-不确定字符串的多串匹配完全自动机及其快速生成方法,包括两个和两个以上不确定字符串相互交错的情况;并且给出V-不确定字符串的完全自动机的最大并行台数,指出通常正则表达式匹配可能出现相似连接和交错情况的两种遗漏,指出如果没有从整体的角度对U-不确定串中的字符子串集进行两两不相交化及无同源后续奇点化的处理,结果就可能出现错误或者增加状态数目.
Multiple string matching is often completed under the presence of Uor V-uncertain-strings, or a combination of the two. Recognizing large numbers of strings with U-, V-, and U-V-uncertain-strings, including the interleaving of two or more uncertain strings, is the key to the successful detection of harmful information. This paper proposes a complete automaton and its high speed construction algorithm to detect large-scale U-, V-, and U-V-uncertain multiple strings, including two or more uncertain strings interlaced with one another. The maximum of the parallel complete automaton of the V-uncertain string is also reported. Finally, this study reveals that two kinds of pretermissions, a similarly connected pretermission and interlaced string pretermission, may appear in the matching of the regular expressions. The result of this maybe mistake or the number of states in the automaton may be increased, if the intersection of the U-uncertain strings sets, and the "homologous subsequent especial point" in the U-uncertain strings sets, never be eliminated from whole system.
出处
《中国科学:信息科学》
CSCD
2011年第5期552-561,共10页
Scientia Sinica(Informationis)
基金
国家自然科学基金(批准号:60873002)
国家重点基础研究发展计划(批准号:2007CB311100)资助项目