摘要
为了从大规模语料中快速提取高频重复模式,以递增n-gram模型为基础,使用散列数据结构提取重复串,并提出了一种基于低频字符和层次剪枝的逐层剪枝算法,用于过滤低频垃圾字串,减少I/O读写次数。在此基础上,应用改进的字串排序算法,使字符串排序可在O(n)时间内完成,从而有效提高重复模式的提取效率。实验表明,该算法是一种有效的重复模式提取算法,其I/O读写次数同语料规模呈线性关系,远小于使用首字符进行语料划分的方法,能快速有效地从规模远大于内存容量的文本语料中提取重复模式,特别适合于大规模语料的高频重复模式提取,对以重复模式为基础的新词识别、术语抽取等具有重要的支撑作用。
To extract high-frequency repeats from large-scale corpus,by using the hash table structure,this paper put forward a hierarchical pruning algorithm based on low-frequency character filtration and Cascade Pruning to filtrate lowfrequency strings and to reduce the times of I/O reading & writing.On this basis,this paper employed the improved string sort algorithm,which can implement string sort in O(n) time complexity,to improve the efficiency of repeat extraction.According to the experimental data,it can be concluded that this algorithm is an effective method of repeat extraction,in which the relationship between the times of I/O reading & writing and the scale of corpus is linear.The algorithm can effectively extract repeats from text corpus whose scale is much larger than that of the computer memory and can better suport the repeat-based applications such as new words identification,term extraction,etc.
出处
《计算机科学》
CSCD
北大核心
2014年第5期270-274,共5页
Computer Science
基金
国家自然科学基金项目(61163045
61263044)
新疆维吾尔自治区高校科研基金(XJEDU2012S29)
新疆师范大学重点学科招标课题(12XSXZ0601)资助
关键词
重复串
散列表
低频字串
逐层剪枝
新词识别
Repeat
Hash table
Low-frequency strings
Hierarchical pruning
New words identification