摘要
将分词问题转化为序列标注问题,使用CRFs标注器进行序列标注是近年来广泛采用的分词方法。针对这一方法中CRFs的标记错误问题,该文提出基于CRFs边缘概率的分词方法。该方法从标注结果中发掘边缘概率高的候选词,重组边缘概率低的候选词,提出FMM的奖励机制修正重组后的子串。在第四届SIGHAN Bakeoff中文简体语料SXU和NCC上进行闭式测试,分别在F-1值上达到了96.41%和94.30%的精度。
The method of treating the word segmentation issue as a sequence tagging problem and using CRFs has been widely applied recently. However, in this method, some wrong tags are produced by CRFs. To reduce the number of wrong tags, we propose a new method based on the marginal probabilities generated by CRFs for Chinese word segmentation. Firstly, the candidate words with high marginal probabilities are extracted from the tagging results. Then, the candidate words of low marginal probabilities in the tagging results are recombined. Finally, a mechanism of premium that is built on FMM is introduced to complement the sub-strings produced by the recombinant procedure. Evalued by the closed track of SXU and NCC corpora in the fourth SIGHAN Chinese Word Segmentation Bakeoff, this method produces an F-score of 96.41% and 94.30%, respectively.
出处
《中文信息学报》
CSCD
北大核心
2009年第5期3-8,共6页
Journal of Chinese Information Processing
基金
国家863高技术资助项目(2006AA012140)
国家自然科学基金资助项目(60673039)
关键词
计算机应用
中文信息处理
中文分词
条件随机场(CRFs)
边缘概率
最大向前匹配(FMM)
全局特征
computer application
Chinese information processing
Chinese word segmentation
Conditional Random Fields(CRFs)
Marginal probability
Forward Maximum Matching(FMM)
global feature