摘要
文中改进了基于后缀数组的无词典分词算法。原算法通过对输入字符集建立后缀数组并按字典序进行排列来筛选汉字结合模式形成候选词集,并通过置信度的比较来筛选候选词集以获得分词集。文中改进了其计算候选词出现频率的方法并且大大减少了筛选候选词集时两两判断候选词是否具有父子关系的次数。试验表明,改进的算法能够在没有词典的情况下更快速构建候选词集和筛选候选词集。适用于对词条频度敏感,对计算速度要求较高的中文信息处理。
It improved the original algorithm of automatic and dictionary-free Chinese segmentation based on suffix array. The original algorithm gets the candidate words by filtering the co-occurrence patterns of Chinese characters extracted from the input corpus with al- phabetically sorted suffix array. And by filtering the candidate words through the confidence comparison the result set words are gotten. In this paper,improved the method that counted the frequency of the candidate words and reduced the number of judgments whether two candidate words have the father-and-son relationship when filtering the candidate words. Experiment results show that by the improved algorithm one can get and filter the candidate words more quickly without the help of the dictionary.' This method is particularly suitable for lexical-frequeney-sensitive as well as time-critical Chinese information processing application.
出处
《计算机技术与发展》
2011年第11期49-52,共4页
Computer Technology and Development
基金
安徽省教育厅自然科学研究资助项目(KJ2009A60)
关键词
自动分词
无词典分词
后缀数组
automatic word segmentation
dictionary-free word segmentation
suffix array