期刊文献+

一种改进的基于后缀数组的无词典分词方法

An Improved Automatic and Dictionary-Free Chinese Word Segmentation Method Based on Suffix Array
下载PDF
导出
摘要 文中改进了基于后缀数组的无词典分词算法。原算法通过对输入字符集建立后缀数组并按字典序进行排列来筛选汉字结合模式形成候选词集,并通过置信度的比较来筛选候选词集以获得分词集。文中改进了其计算候选词出现频率的方法并且大大减少了筛选候选词集时两两判断候选词是否具有父子关系的次数。试验表明,改进的算法能够在没有词典的情况下更快速构建候选词集和筛选候选词集。适用于对词条频度敏感,对计算速度要求较高的中文信息处理。 It improved the original algorithm of automatic and dictionary-free Chinese segmentation based on suffix array. The original algorithm gets the candidate words by filtering the co-occurrence patterns of Chinese characters extracted from the input corpus with al- phabetically sorted suffix array. And by filtering the candidate words through the confidence comparison the result set words are gotten. In this paper,improved the method that counted the frequency of the candidate words and reduced the number of judgments whether two candidate words have the father-and-son relationship when filtering the candidate words. Experiment results show that by the improved algorithm one can get and filter the candidate words more quickly without the help of the dictionary.' This method is particularly suitable for lexical-frequeney-sensitive as well as time-critical Chinese information processing application.
作者 刘京城 刘锋
机构地区 安徽大学
出处 《计算机技术与发展》 2011年第11期49-52,共4页 Computer Technology and Development
基金 安徽省教育厅自然科学研究资助项目(KJ2009A60)
关键词 自动分词 无词典分词 后缀数组 automatic word segmentation dictionary-free word segmentation suffix array
  • 相关文献

参考文献13

二级参考文献48

共引文献300

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部