摘要
本文在分析汉语分词一般模型基础上,引入词形概率、词整合系数和词形网格等概念,提出了一个基于词形的汉语文本切分模型,并实现了一个反向动态规划和正向栈解码相结合的二次扫描的汉语文本切分算法。由于引入了词形概率、词整合系数,本模型不仅反映了词形统计构词规律,而且在一定程度上体现了长词优先的切分原则。初步测试表明,本方法的切分准确率和消歧率分别可达996%和9344%。
In this paper,word form probability,word form coefficient and word lattice are introduced to construct a word formality based segmentation model,and a two way scanning segmentation algorithm is implemented incorporating backward dynamic programming algorithm with forward stack decoding algorithm.Not only the statistic law on word formality,but also the principle of longest word first to some extent is reflected in the model,due to the introducing of word form probability and coefficient.Finally a segmentation accuracy rate of 99 6% and a disambiguation rate of 93 44% are achieved in the primary experiment.
出处
《情报学报》
CSSCI
北大核心
1999年第3期235-240,共6页
Journal of the China Society for Scientific and Technical Information
基金
国家863项目资助
关键词
汉语分词
词形概率
整合系数
词形网格
信息处理
Chinese word segmentation,word form probability,word form coefficient,word Form lattice.