摘要
汉语自动分词是中文信息处理中的基础课题。本文首先对汉语分词的基本概念与应用,以及汉语分词的基本方法进行了概述。接着引出一种根据词的出现概率、基于极大似然原则构建的汉语自动分词的零阶马尔可夫模型,并重点剖析了EM(Expectation- Maximization)算法,最后给出了一个基于本模型的汉语文本处理仿真系统。
Word Segmentation is a basic task of Chinese Information Processing. In this paper we present a simple probabilistic model of Chinese text based on the occurrence probability of the words, which can be seen as a zero-th order hidden Markov Model (HMM). Then we investigate how to discover by EM algorithm the words and their probabilities from a corpus of unsegmented text without using a dictionary. The last part presents a simulation system of processing Chinese text.
出处
《系统仿真学报》
CAS
CSCD
2002年第5期544-546,550,共4页
Journal of System Simulation
基金
国家自然科学基金项目(编号: 69975024)
国家自然科学基金重点项目(编号: 69931040)