期刊文献+

基于逐层剪枝的中文高频重复模式快速提取算法

Rapid Algorithm of Chinese High-frequency Repeat Extraction Based on Hierarchical Pruning
下载PDF
导出
摘要 为了从大规模语料中快速提取高频重复模式,以递增n-gram模型为基础,使用散列数据结构提取重复串,并提出了一种基于低频字符和层次剪枝的逐层剪枝算法,用于过滤低频垃圾字串,减少I/O读写次数。在此基础上,应用改进的字串排序算法,使字符串排序可在O(n)时间内完成,从而有效提高重复模式的提取效率。实验表明,该算法是一种有效的重复模式提取算法,其I/O读写次数同语料规模呈线性关系,远小于使用首字符进行语料划分的方法,能快速有效地从规模远大于内存容量的文本语料中提取重复模式,特别适合于大规模语料的高频重复模式提取,对以重复模式为基础的新词识别、术语抽取等具有重要的支撑作用。 To extract high-frequency repeats from large-scale corpus,by using the hash table structure,this paper put forward a hierarchical pruning algorithm based on low-frequency character filtration and Cascade Pruning to filtrate lowfrequency strings and to reduce the times of I/O reading & writing.On this basis,this paper employed the improved string sort algorithm,which can implement string sort in O(n) time complexity,to improve the efficiency of repeat extraction.According to the experimental data,it can be concluded that this algorithm is an effective method of repeat extraction,in which the relationship between the times of I/O reading & writing and the scale of corpus is linear.The algorithm can effectively extract repeats from text corpus whose scale is much larger than that of the computer memory and can better suport the repeat-based applications such as new words identification,term extraction,etc.
出处 《计算机科学》 CSCD 北大核心 2014年第5期270-274,共5页 Computer Science
基金 国家自然科学基金项目(61163045 61263044) 新疆维吾尔自治区高校科研基金(XJEDU2012S29) 新疆师范大学重点学科招标课题(12XSXZ0601)资助
关键词 重复串 散列表 低频字串 逐层剪枝 新词识别 Repeat Hash table Low-frequency strings Hierarchical pruning New words identification
  • 相关文献

参考文献6

二级参考文献77

共引文献358

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部