摘要
本文提出了一种快速汉语自动分词算法。其主要思想是利用汉语中两字词占75%的统计规律,提出了两字词根和两字词簇的概念。算法把三音节以上的词用两字词簇来压缩处理,也就是把长词的扫描范围限定在词汇量很小的词簇内,从而不仅提高了分词速度,而且彻底解决了传统最大匹配分词算法中最大匹配词长的设定问题。另外,本文还提出了用两字词簇快速检测交叉歧义的算法。本文的分词算法简洁、速度快。
A fast algorithm for Chinese words automatic segment is put forward in this paper.A structure called “two letters word family”which is the collection of all the Chinese words that share the same beginning two letters is introduced.The key idea of the algorithm is to compress the words which consist of more than three Chinese letters into two letters word family and handle together using length changing maximum matching algorithm.In addition to this,a new method to detect segmenting ambiguousness is also introduced.
出处
《情报学报》
CSSCI
北大核心
1998年第5期352-357,共6页
Journal of the China Society for Scientific and Technical Information
关键词
自然语言处理
汉语
分词算法
自分分词
两字词族
natural language processing,Chinese words automatic segmenting,segmenting ambiguousness.