摘要
频繁模式查找对新词识别、网络舆情监测、生物信息序列检测等领域有很高的应用价值。为处理规模远超出内存的语料,提出了一种实用的频繁模式查找算法。先将语料按后缀首字符划分为多个集合,通过逐条扫描集合数据,搜索出最大化最长公共前缀区间(MLCPI)来完成查找。另外在此基础上提出逐层归并算法,实现查找的同时归并子串。由于进行查找时无需将全部数据导入内存,因此资源消耗较少;各集合间频繁模式查找互不干扰,可采用并行处理加快运行速度。使用4.61G纯文本语料进行了试验,结果表明其内存消耗小于30M,查找速度最快达1.08M/s,能高效地进行子串归并。
Frequent patterns finding is useful for some areas, such as new word recognition, internet public opinion monitoring, bio-information series detection, etc. Considering that corpus size is much larger than memory capacity,we put forward a pragmatic algorithm to find frequent patterns. Firstly, corpus was partitioned into multiple sets based on first character of suffix, and then a concept of maximized longest common prefix interval (MLCPI) was introduced, and by means of searching it while scanning data in sets item by item, we accomplished the finding task. Besides, we proposed hierarchical reduction algorithm (HRA) to reduce sub-string during the finding process on that basis. There is no need to import all data into memory, so it will decrease resource consumption. Moreover, it is found that frequent patterns among sets do not interfere with each other, which will improve the speed while processing paralleled. We used 4. 61 gigabytes plain text as experiment data. The results show that the memory usage is lower than 30 megabytes, and the speed is up to 1.08 megabytes per seconds,and it is able to reduce sub-string efficiently.
出处
《计算机科学》
CSCD
北大核心
2012年第3期149-152,169,共5页
Computer Science
基金
国家863计划重点项目(2006AA010109)资助
关键词
频繁模式
重复串
语料划分
子串归并
Frequent pattern,Repeats,Corpus partition,Sub-string reduction