摘要
中文分词的方法主要可分为基于规则和基于统计两大类:前者一般借助于词图的方法,将分词问题转化为最优路径问题,通常切分结果不惟一;后者利用统计模型对语料库进行统计,计算量较大,但准确率较高。对词图和N元语法进行了介绍,并结合两者实现了一种中文分词方法。该方法将词图中的最大概率路径作为中文句子分词的结果,其中涉及对语料库进行二元词频统计,设计了一个多级哈希结构的分词词典,实验数据表明该方法能有效地进行自动分词。
There are two methods of Chinese word segmentation based on rule and statistics, the former usually use word graph and the latter use statistics model. Word graph and N-gram are introduced, and a system of Chinese word segmentation is constructed based on them. This system regards the maximum propability path in word graph as the result of word segmentation in Chinese sentence, bigram frequency is counted in corpus, and a word dictionary is designed with multilevel hash structure. The experimental data show that it car segment Chinese efficiently.
出处
《计算机工程与设计》
CSCD
北大核心
2008年第24期6370-6372,共3页
Computer Engineering and Design
关键词
中文分词
词图
二元语法
最大概率
最优路径
chinese word segmentation
word graph
bigram
maximum probability
best path