期刊文献+

CRF模型中参数f在字标注汉语分词中的适用性研究 被引量:3

Research on the Applicability of Parameter f in Character-based Tagging Approach of Chinese Word Segmentation
下载PDF
导出
摘要 汉语分词作为中文信息处理的首要环节,其精确度对后续步骤的准确度和处理速度成逐级放大性影响.如何提高分词的准确度和处理速度成为近年研究的重点.采用条件随机场模型进行汉语分词,通过定量分析CRF工具包训练参数f,研究减少特征对分词准确度以及模型大小的影响程度,实验分别在国际汉语分词评测Bakeoff2005提供的北京大学和微软亚洲研究院两个语料上进行封闭测试,并对比采用不同模板时增加f参数值对分词性能的影响,最终得出实验结果:随着f参数值的增加,分词的准确度和生成的模型大小成正比,且F值减小的程度相对训练生成模型大小的减小程度要小得多. As the first and foremost part of Chinese information processing,the accuracy of Chinese word segmentation direct lead to magnified effect of the accuracy and processing speed in the following steps.In recent years,more and more researchers focus on how to improve the accuracy and processing speed of Chinese word segmentation.In this paper,the conditional random field model is used to segment Chinese word.Through quantitative analysis of the parameter f in CRF training process,a lot of experimental are done to find out whether the reduction of features can affect the accuracy of Chinese word segmentation and the size of model.Closed evaluations are performed on PKU and MSRA corpus provided by the second international Chinese word segmentation Bakeoff-2005 with the different templates compare to the different experimental data on increasingly parameter value f for one to ten or one to twenty.The final results show: Increase of f parameter value,the accuracy of Chinese word segmentation is always proportionate to the model size,and the decrease of F is far smaller than the model size which generated by training process.
出处 《郑州大学学报(工学版)》 CAS 北大核心 2011年第4期103-106,共4页 Journal of Zhengzhou University(Engineering Science)
基金 国家自然科学基金资助项目(60875081) 河南省教育厅高等学校青年骨干教师资助项目(2009GGJS-108)
关键词 汉语分词 字标注 f阈值 模型大小 CRF++工具包 Chinese word segmentation character tagging parameter f model size conditional random fields toolkit
  • 相关文献

参考文献8

  • 1姜维,王晓龙,关毅,赵健.基于多知识源的中文词法分析系统[J].计算机学报,2007,30(1):137-145. 被引量:29
  • 2刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:197
  • 3DAVID D P. A trainable rule-based algorithm for word segmentation[ C ]. spain: Proceedings of ACL 1997, 321 - 328.
  • 4CHENG K S,YOUNG G H,WONG K F. A study on word-based and integral-bit Chinese text compression algorithms[ J]. Jounal of the American Society for Information Science ,2001,50( 3 ) :218 - 228.
  • 5SPROAT R. A stochastic finite-state wod segmentation algorithm for Chinese [ J ]. Computational Linguistics. 1996,22 ( 3 ) : 377 - 404.
  • 6XUE Nian-wen. Chinese word segmentation as character tagging[ J]. Computational Linguistics and Chinese Language Processing, 2003,8 ( 1 ) :29 - 48.
  • 7ZHANG Rui-qiang, GENICHIRO K, EIICHIRO S. Sub- word-based tagging for confidence dependent Chinese word segmentation[ C]. Australia. Proceedings of the COLING/ACL, 2006:961 - 968.
  • 8RABINER L R. A tutorial on hidden markov models and selected applications in speech recognition [ J]. Proceedings oflEEE. 1989, 77(2): 257-286.

二级参考文献38

  • 1赵健,王晓龙,关毅.中文名实体识别中的特征组合与特征融合的比较[J].计算机应用,2005,25(11):2647-2649. 被引量:7
  • 2姜维,王晓龙,关毅,徐志明.应用粗糙集理论提取特征的词性标注模型[J].高技术通讯,2006,16(10):996-1000. 被引量:3
  • 3H Y Tan. Chinese place automatic recognition research. In: C N Huang, Z D Dong, eds. Proc of Computational Language.Beijing: Tsinghua University Press, 1999
  • 4Zhang Huaping, Liu Qun, Zhang Hao, et al. Automatic recognition of Chinese unknown words recognition. First SIGHAN Workshop Attached with the 19th COLING, Taipei, 2002
  • 5S R Ye, T S Chua, J M Liu. An agent-based approach to Chinese named entity recognition. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002
  • 6J Sun, J F Gao, L Zhang, et al. Chinese named entity identification using class-based language model. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002
  • 7Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc of IEEE, 1989,77(2): 257~286
  • 8Shai Fine, Yoram Singer, Naftali Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning,1998, 32(1): 41~62
  • 9Richard Sproat, Thomas Emerson. The first international Chinese word segmentation bakeoff. The First SIGHAN Workshop Attached with the ACL2003, Sapporo, Japan, 2003. 133~143
  • 10J Hockenmaier, C Brew. Error-driven learning of Chinese word segmentation. In: J Guo, K T Lua, J Xu, eds. The 12th Pacific Conf on Language and Information, Singapore, 1998

共引文献211

同被引文献30

引证文献3

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部