基于大规模语料划分的频繁模式查找算法被引量：1

Algorithm of Frequent Patterns Finding Based on Large Scale Corpus Partition

下载PDF

导出

摘要频繁模式查找对新词识别、网络舆情监测、生物信息序列检测等领域有很高的应用价值。为处理规模远超出内存的语料,提出了一种实用的频繁模式查找算法。先将语料按后缀首字符划分为多个集合,通过逐条扫描集合数据,搜索出最大化最长公共前缀区间(MLCPI)来完成查找。另外在此基础上提出逐层归并算法,实现查找的同时归并子串。由于进行查找时无需将全部数据导入内存,因此资源消耗较少;各集合间频繁模式查找互不干扰,可采用并行处理加快运行速度。使用4.61G纯文本语料进行了试验,结果表明其内存消耗小于30M,查找速度最快达1.08M/s,能高效地进行子串归并。 Frequent patterns finding is useful for some areas, such as new word recognition, internet public opinion monitoring, bio-information series detection, etc. Considering that corpus size is much larger than memory capacity,we put forward a pragmatic algorithm to find frequent patterns. Firstly, corpus was partitioned into multiple sets based on first character of suffix, and then a concept of maximized longest common prefix interval （MLCPI） was introduced, and by means of searching it while scanning data in sets item by item, we accomplished the finding task. Besides, we proposed hierarchical reduction algorithm （HRA） to reduce sub-string during the finding process on that basis. There is no need to import all data into memory, so it will decrease resource consumption. Moreover, it is found that frequent patterns among sets do not interfere with each other, which will improve the speed while processing paralleled. We used 4. 61 gigabytes plain text as experiment data. The results show that the memory usage is lower than 30 megabytes, and the speed is up to 1.08 megabytes per seconds,and it is able to reduce sub-string efficiently.

作者丁溪源黄河燕张海军王树梅

机构地区南京理工大学计算机科学与技术学院中国科学院计算机语言信息工程研究中心北京理工大学计算机科学技术学院

出处《计算机科学》 CSCD 北大核心 2012年第3期149-152,169,共5页 Computer Science

基金国家863计划重点项目(2006AA010109)资助

关键词频繁模式重复串语料划分子串归并 Frequent pattern,Repeats,Corpus partition,Sub-string reduction

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1Hunt E, Atkinson M P, W I R. A database index to large biological sequences[C]//Proc. of the 27th VLDB Conf. Roma, Italy: Springer, 2001 : 139-148.
2Ukkonen E. On-line construction of suffix trees[J]. Algorithmica, 1995,14.
3Manber U, Myers G. Suffix arrays.. A new method for on-line string searches. [J]. SIAM J. Comput. , 1993,22 (5) : 935-948.
4Yamamoto M, Churcht K W. Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus[J].Computational Linguistics, 2001,27 (1) : 1-30.
5Larsson N J, SadakaneK. Fast suffix sorting [R]. Sweden: Dept. of Computer Science, Lund University, 1999.
6Juha Karkkainen P S. Linear Work Suffix Array Construction[J]. Journal of the ACM, 2006,53(6) :918-936.
7Klaus-Bernd S, Stoye J. Suffix Tree Construction and Storage with Limited Main Memory [R]. Germany :University of Bielefeld, 2003.
8Chen Z, Fowler IL Fast construction of generalized suffix trees over a very large alphabet[C]//Proc, of Int Conf on Computing and Cobinatorics. 2003.
9龚才春,贺敏,陈海强,许洪波,程学旗.大规模语料的频繁模式快速发现算法[J].通信学报,2007,28(12):161-166. 被引量：4
10吕学强,张乐,黄志丹,胡俊峰.基于散列技术的快速子串归并算法[J].复旦学报（自然科学版）,2004,43(5):948-951. 被引量：4

二级参考文献28

1张民,李生,赵铁军.大规模汉语语料库中任意n的n-gram统计算法及知识获取方法[J].情报学报,1997,16(1):28-35. 被引量：4
2Merkel M, Andersson M. Knowledge-lite extraction of multi-word units with language filters and entropy thresholds[A]. Proceedings of 2000 Conference on User-Oriented Content-Based Text and Image Handling[C]. Paris, France:ACM Press, 2000. 737-746.
3He S,Zhu J. An iterative method for extracting Chinese unknown words[J]. Chinese Journal of Electronics,2001,10(4):461-464.
4Nagao M,Mori S. A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese[A]. Proceedings from the 15th International Conference on Computational Linguistics[C]. Kyoto: ACL,1994.
5邹刚.面向Internet的中文新词语检测[D].中国科学院计算技术研究所,2004.
6FISCHER J, HEUN V, KRAMER S. Fast frequent string mining using suffix arrays[A]. Proceedings of the 5^th IEEE International Conference on Data Mining[C]. 2005.609-612.
7BEDATHUR S, HARITSA J R. Search-optimized persistent suffix tree storage for biological applications[A]. Proceedings of 12^th IEEE International Conference on High Performance Computing[C]. 2005.
8WEINER P. Linear pattern matching algorithms[A]. Proceedings of the 14^th Annual IEEE Symposium on Switching and Automata Theory[C]. 1973.1-11.
9MCCREIGHT E M. A space-economical suffix tree construction algorithm[J]. Journal of the ACM, 1976, 23(2):262-272.
10UKKONEN E. On-line construction of suffix-trees, algorith mica[J]. 1995, 14:249-260.

共引文献6

1贺敏,王丽宏,杜攀,张瑾,程学旗.基于有意义串聚类的微博热点话题发现方法[J].通信学报,2013,34(S1):256-262. 被引量：12
2周浪,史树敏,冯冲,黄河燕.基于多策略融合的中文术语抽取方法[J].情报学报,2010,29(3):460-467. 被引量：28
3周浪,冯冲,黄河燕,王平尧.一种基于独立性统计的子串归并算法[J].计算机工程与应用,2010,46(24):129-131. 被引量：1
4张海军,刘战东,木妮娜.基于逐层剪枝的中文高频重复模式快速提取算法[J].计算机科学,2014,41(5):270-274.
5侯婷,吕学强,李卓.专利术语抽取的层次过滤方法[J].现代图书情报技术,2015(1):24-30. 被引量：6
6张海军,李勇,闫琪琪.一种基于海量语料的网络热点新词识别方法[J].计算机工程与应用,2015,51(5):208-213. 被引量：6

同被引文献8

1Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corporal[J].Computational Linguistics, 1993, 19 ( 1 ) : 75-90.
2Champollion M X. A Robust Parallel Text Sentence Aligner [C]// Proceedings of LREC-2006: Fifth International Conference on Language Resources and Evaluation. 2006:489-492.
3Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora [C] // Proceedings of AMTA. Springer-Verlag, 2002 : 135-144.
4Nirenburg S. Two Approaches of Matching in Example-Based Machine Translation[C]//Proe. TMT-93. Kyoto,Japan, 1993.
5Watt A正则表达式入门经典[M].李松峰,李丽,译.北京:清华大学出版社,2008:156-178.
6郭锐,宋继华,廖敏.基于自动句对齐的相似古文句子检索[J].中文信息学报,2008,22(2):87-91. 被引量：15
7于新,吴健,洪锦玲.基于词典的汉藏句子对齐研究与实现[J].中文信息学报,2011,25(4):57-62. 被引量：10
8李素建,张健,黄雄,白硕,刘群.Semantic Computation in a Chinese Question—Answering System[J].Journal of Computer Science & Technology,2002,17(6):933-939. 被引量：30

引证文献1

1马创新,陈小荷,曲维光.注疏文献中的注释语句自动分析[J].计算机科学,2012,39(10):220-223. 被引量：8

二级引证文献8

1马创新,陈小荷.基于XML的《论语》与其注疏文献对齐语料库的知识表示[J].图书情报知识,2013,30(1):107-113. 被引量：10
2马创新,陈小荷,曲维光.经典古籍注疏文献的知识网络研究与设计[J].图书情报工作,2013,57(9):124-128. 被引量：7
3孙小双.一种新的基于注释内容的学习者评价方法研究[J].软件导刊.教育技术,2013,12(12):73-75.
4顾磊,赵阳.古籍智能整理研究现状及存在的问题[J].图书馆学研究,2016(9):54-58. 被引量：4
5马创新,陈小荷.基于本体和XML的注疏文献的结构化知识表示[J].图书馆杂志,2017,36(8):62-68. 被引量：6
6邓三鸿,胡昊天,王昊,王东波.古文自动处理研究现状与新时代发展趋势展望[J].科技情报研究,2021,3(1):1-20. 被引量：27
7李惠,陈涛,侯君明,刘丁,朱庆华,刘炜.钩玄提要——古籍目录智能分析工具构建[J].中国图书馆学报,2021,47(4):97-112. 被引量：13
8李明杰,郑翔,关思雨.数字人文在古籍整理领域的应用路径及发展策略[J].中国图书馆学报,2024,50(4):21-38.

1龚才春,贺敏,陈海强,许洪波,程学旗.大规模语料的频繁模式快速发现算法[J].通信学报,2007,28(12):161-166. 被引量：4
2刘克强,沈映梅.《三国演义》中英文平行语料库的研制与创建[J].红河学院学报,2013,11(2):34-39.
3任飞亮,吕学强,吴宏林,姚天顺.基于统计的锚点词候选集的获取[J].计算机工程与应用,2003,39(32):55-57.
4翟素兰,郑诚,乐毅.用改进的权值树进行入侵检测[J].微机发展,2004,14(12):62-64. 被引量：2
5崔宇,田志宏,张宏莉,方滨兴.基于前缀区间集合的IPv6路由查找算法[J].通信学报,2013,34(6):29-37. 被引量：3
6周浪,冯冲,黄河燕,王平尧.一种基于独立性统计的子串归并算法[J].计算机工程与应用,2010,46(24):129-131. 被引量：1
7张海军,刘战东,木妮娜.基于逐层剪枝的中文高频重复模式快速提取算法[J].计算机科学,2014,41(5):270-274.
8赵国锋,李振强,黄小红,马严.基于前缀区间和前缀层次(PRPL)的IPv6路由查找算法[J].高技术通讯,2008,18(4):345-349. 被引量：1
9吕学强,张乐,黄志丹,胡俊峰.基于散列技术的快速子串归并算法[J].复旦学报（自然科学版）,2004,43(5):948-951. 被引量：4
10陈卓,谭志欢.无线传感器网络中基于路径序列检测的安全机制[J].计算机应用,2015,35(3):732-735. 被引量：3

计算机科学

2012年第3期

浏览历史

内容加载中请稍等...

基于大规模语料划分的频繁模式查找算法被引量：1

参考文献11

二级参考文献28

共引文献6

同被引文献8

引证文献1

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于大规模语料划分的频繁模式查找算法 被引量：1

参考文献11

二级参考文献28

共引文献6

同被引文献8

引证文献1

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于大规模语料划分的频繁模式查找算法被引量：1