摘要
中文分词是中文信息处理的重要内容之一。在基于最大匹配和歧义检测的粗分方法获取中文粗分结果集上,根据隐马尔可夫模型标注词性,通过Viterbi算法对每个中文分词的粗分进行词性标注。通过定义最优分词粗分的评估函数对每个粗分的词性标注进行粗分评估,获取最优的粗分为最终分词。通过实验对比,证明基于粗分和词性标注的中文分词方法具有良好的分词效果。
The segmentation of Chinese words from text documents is one of important contents of Chinese information processing. After every segmentation of Chinese words is obtained by the Chinese word rough segmentation by maximum match and ambiguity detection algorithms, each word in every rough segmentation is tagged by Viterbi algorithm according to HMM model of part-of-speech tagging. Each rough segmentation is estimated by the definition of optimal estimation function of part-of-speech tagging, and then the best one is selected as the optimal segmentation. The segmentation presented is better than others by the comparison of experiments.
出处
《计算机工程与应用》
CSCD
北大核心
2015年第6期204-207,265,共5页
Computer Engineering and Applications
基金
国家高新技术研究发展计划(No.2009AA062802)
国家自然科学基金(No.60473125)
中国石油(CNPC)石油科技中青年创新基金(No.05E7013)
国家重大专项子课题(No.G5800-08-ZS-WX)
关键词
分词
词性标注
隐马尔可夫模型
VITERBI算法
word segmentation
part-of-speech tagging
Hidden Markov Model(HMM)
Viterbi algorithm