期刊文献+

基于CRFs边缘概率的中文分词 被引量:19

Chinese Word Segmentation Based on the Marginal Probabilities Generated by CRFs
下载PDF
导出
摘要 将分词问题转化为序列标注问题,使用CRFs标注器进行序列标注是近年来广泛采用的分词方法。针对这一方法中CRFs的标记错误问题,该文提出基于CRFs边缘概率的分词方法。该方法从标注结果中发掘边缘概率高的候选词,重组边缘概率低的候选词,提出FMM的奖励机制修正重组后的子串。在第四届SIGHAN Bakeoff中文简体语料SXU和NCC上进行闭式测试,分别在F-1值上达到了96.41%和94.30%的精度。 The method of treating the word segmentation issue as a sequence tagging problem and using CRFs has been widely applied recently. However, in this method, some wrong tags are produced by CRFs. To reduce the number of wrong tags, we propose a new method based on the marginal probabilities generated by CRFs for Chinese word segmentation. Firstly, the candidate words with high marginal probabilities are extracted from the tagging results. Then, the candidate words of low marginal probabilities in the tagging results are recombined. Finally, a mechanism of premium that is built on FMM is introduced to complement the sub-strings produced by the recombinant procedure. Evalued by the closed track of SXU and NCC corpora in the fourth SIGHAN Chinese Word Segmentation Bakeoff, this method produces an F-score of 96.41% and 94.30%, respectively.
出处 《中文信息学报》 CSCD 北大核心 2009年第5期3-8,共6页 Journal of Chinese Information Processing
基金 国家863高技术资助项目(2006AA012140) 国家自然科学基金资助项目(60673039)
关键词 计算机应用 中文信息处理 中文分词 条件随机场(CRFs) 边缘概率 最大向前匹配(FMM) 全局特征 computer application Chinese information processing Chinese word segmentation Conditional Random Fields(CRFs) Marginal probability Forward Maximum Matching(FMM) global feature
  • 相关文献

参考文献16

  • 1Nianwen Xue.Chinese Word Segmentation as Character Tagging[J].Computational Linguistics and Chinese Language Processing,2003,8(1):29-48.
  • 2Hai Zhao,Chang-Ning Huang and Mu Li.An Improved Chinese Word Segmentation System with Conditional Random Field[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.Sydney,Australia:2006:108-117.
  • 3John Lafferty,Andrew McCallum and Fernando Pereira.Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proc.of ICML-18.Williams College,USA,2001:282-289.
  • 4Fuchun Peng,Fangfang Feng and Andrew McCallum.Chinese Segmentation and New Word Detection using Conditional Random Fields[C]//COLING 2004.Geneva,Switzerland,2004:562-568.
  • 5赵海,揭春雨.基于有效子串标注的中文分词[J].中文信息学报,2007,21(5):8-13. 被引量:26
  • 6Ruiqiang Zhang,Genichiro Kitkui and Eiichiro Sumita.Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation[C]//HLT/NAACL-2006.New York,USA:2006,193-196.
  • 7Yanxin Shi,Mengqiu Wang.A Dual-layer CRFs Based Joint Decoding Method for Cascaded Segmentation and Labeling Tasks[C]//Proc.of International Joint Con ferences on Artificial Intelligence.Hyderabad,India,2007:1707-1712.
  • 8Zhou Jun-sheng,Dai Xin-yu,Ni Rui-yu and Chen jiajun.A Hybrid Approach to Chinese Word Segmentation around CRFs[C]//Proceedings of the Fouth SIGHAN Workshop on Chinese Language Processing.Jejulsland,Korea,2005:196-199.
  • 9Dong Song and Anoop Sarkar.Voting between Dictionaray-based and Subword Tagging Models for Chinese Word Segmentation[C]//Proceedings of the Fifth SIGHAN Workshbp on Chinese Language Processing.Sydney,Australia,2006:126-129.
  • 10Ruiqiang Zhang,Genichiro Kikui and Eiichiro Sumita.Subword-based tagging for confidence-dependent Chinese word segmentation[C]//Proc,of the COLING/ACL on Main conference poster sessions.Sydney,Australia,2006:961-968.

二级参考文献5

共引文献25

同被引文献126

引证文献19

二级引证文献159

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部