期刊文献+

基于大规模语料划分的频繁模式查找算法 被引量:1

Algorithm of Frequent Patterns Finding Based on Large Scale Corpus Partition
下载PDF
导出
摘要 频繁模式查找对新词识别、网络舆情监测、生物信息序列检测等领域有很高的应用价值。为处理规模远超出内存的语料,提出了一种实用的频繁模式查找算法。先将语料按后缀首字符划分为多个集合,通过逐条扫描集合数据,搜索出最大化最长公共前缀区间(MLCPI)来完成查找。另外在此基础上提出逐层归并算法,实现查找的同时归并子串。由于进行查找时无需将全部数据导入内存,因此资源消耗较少;各集合间频繁模式查找互不干扰,可采用并行处理加快运行速度。使用4.61G纯文本语料进行了试验,结果表明其内存消耗小于30M,查找速度最快达1.08M/s,能高效地进行子串归并。 Frequent patterns finding is useful for some areas, such as new word recognition, internet public opinion monitoring, bio-information series detection, etc. Considering that corpus size is much larger than memory capacity,we put forward a pragmatic algorithm to find frequent patterns. Firstly, corpus was partitioned into multiple sets based on first character of suffix, and then a concept of maximized longest common prefix interval (MLCPI) was introduced, and by means of searching it while scanning data in sets item by item, we accomplished the finding task. Besides, we proposed hierarchical reduction algorithm (HRA) to reduce sub-string during the finding process on that basis. There is no need to import all data into memory, so it will decrease resource consumption. Moreover, it is found that frequent patterns among sets do not interfere with each other, which will improve the speed while processing paralleled. We used 4. 61 gigabytes plain text as experiment data. The results show that the memory usage is lower than 30 megabytes, and the speed is up to 1.08 megabytes per seconds,and it is able to reduce sub-string efficiently.
出处 《计算机科学》 CSCD 北大核心 2012年第3期149-152,169,共5页 Computer Science
基金 国家863计划重点项目(2006AA010109)资助
关键词 频繁模式 重复串 语料划分 子串归并 Frequent pattern,Repeats,Corpus partition,Sub-string reduction
  • 相关文献

参考文献11

  • 1Hunt E, Atkinson M P, W I R. A database index to large biological sequences[C]//Proc. of the 27th VLDB Conf. Roma, Italy: Springer, 2001 : 139-148.
  • 2Ukkonen E. On-line construction of suffix trees[J]. Algorithmica, 1995,14.
  • 3Manber U, Myers G. Suffix arrays.. A new method for on-line string searches. [J]. SIAM J. Comput. , 1993,22 (5) : 935-948.
  • 4Yamamoto M, Churcht K W. Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus[J].Computational Linguistics, 2001,27 (1) : 1-30.
  • 5Larsson N J, SadakaneK. Fast suffix sorting [R]. Sweden: Dept. of Computer Science, Lund University, 1999.
  • 6Juha Karkkainen P S. Linear Work Suffix Array Construction[J]. Journal of the ACM, 2006,53(6) :918-936.
  • 7Klaus-Bernd S, Stoye J. Suffix Tree Construction and Storage with Limited Main Memory [R]. Germany :University of Bielefeld, 2003.
  • 8Chen Z, Fowler IL Fast construction of generalized suffix trees over a very large alphabet[C]//Proc, of Int Conf on Computing and Cobinatorics. 2003.
  • 9龚才春,贺敏,陈海强,许洪波,程学旗.大规模语料的频繁模式快速发现算法[J].通信学报,2007,28(12):161-166. 被引量:4
  • 10吕学强,张乐,黄志丹,胡俊峰.基于散列技术的快速子串归并算法[J].复旦学报(自然科学版),2004,43(5):948-951. 被引量:4

二级参考文献28

  • 1张民,李生,赵铁军.大规模汉语语料库中任意n的n-gram统计算法及知识获取方法[J].情报学报,1997,16(1):28-35. 被引量:4
  • 2Merkel M, Andersson M. Knowledge-lite extraction of multi-word units with language filters and entropy thresholds[A]. Proceedings of 2000 Conference on User-Oriented Content-Based Text and Image Handling[C]. Paris, France:ACM Press, 2000. 737-746.
  • 3He S,Zhu J. An iterative method for extracting Chinese unknown words[J]. Chinese Journal of Electronics,2001,10(4):461-464.
  • 4Nagao M,Mori S. A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese[A]. Proceedings from the 15th International Conference on Computational Linguistics[C]. Kyoto: ACL,1994.
  • 5邹刚.面向Internet的中文新词语检测[D].中国科学院计算技术研究所,2004.
  • 6FISCHER J, HEUN V, KRAMER S. Fast frequent string mining using suffix arrays[A]. Proceedings of the 5^th IEEE International Conference on Data Mining[C]. 2005.609-612.
  • 7BEDATHUR S, HARITSA J R. Search-optimized persistent suffix tree storage for biological applications[A]. Proceedings of 12^th IEEE International Conference on High Performance Computing[C]. 2005.
  • 8WEINER P. Linear pattern matching algorithms[A]. Proceedings of the 14^th Annual IEEE Symposium on Switching and Automata Theory[C]. 1973.1-11.
  • 9MCCREIGHT E M. A space-economical suffix tree construction algorithm[J]. Journal of the ACM, 1976, 23(2):262-272.
  • 10UKKONEN E. On-line construction of suffix-trees, algorith mica[J]. 1995, 14:249-260.

共引文献6

同被引文献8

  • 1Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corporal[J].Computational Linguistics, 1993, 19 ( 1 ) : 75-90.
  • 2Champollion M X. A Robust Parallel Text Sentence Aligner [C]// Proceedings of LREC-2006: Fifth International Conference on Language Resources and Evaluation. 2006:489-492.
  • 3Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora [C] // Proceedings of AMTA. Springer-Verlag, 2002 : 135-144.
  • 4Nirenburg S. Two Approaches of Matching in Example-Based Machine Translation[C]//Proe. TMT-93. Kyoto,Japan, 1993.
  • 5Watt A正则表达式入门经典[M].李松峰,李丽,译.北京:清华大学出版社,2008:156-178.
  • 6郭锐,宋继华,廖敏.基于自动句对齐的相似古文句子检索[J].中文信息学报,2008,22(2):87-91. 被引量:15
  • 7于新,吴健,洪锦玲.基于词典的汉藏句子对齐研究与实现[J].中文信息学报,2011,25(4):57-62. 被引量:10
  • 8李素建,张健,黄雄,白硕,刘群.Semantic Computation in a Chinese Question—Answering System[J].Journal of Computer Science & Technology,2002,17(6):933-939. 被引量:30

引证文献1

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部