摘要
汉语分词作为中文信息处理的首要环节,其精确度对后续步骤的准确度和处理速度成逐级放大性影响.如何提高分词的准确度和处理速度成为近年研究的重点.采用条件随机场模型进行汉语分词,通过定量分析CRF工具包训练参数f,研究减少特征对分词准确度以及模型大小的影响程度,实验分别在国际汉语分词评测Bakeoff2005提供的北京大学和微软亚洲研究院两个语料上进行封闭测试,并对比采用不同模板时增加f参数值对分词性能的影响,最终得出实验结果:随着f参数值的增加,分词的准确度和生成的模型大小成正比,且F值减小的程度相对训练生成模型大小的减小程度要小得多.
As the first and foremost part of Chinese information processing,the accuracy of Chinese word segmentation direct lead to magnified effect of the accuracy and processing speed in the following steps.In recent years,more and more researchers focus on how to improve the accuracy and processing speed of Chinese word segmentation.In this paper,the conditional random field model is used to segment Chinese word.Through quantitative analysis of the parameter f in CRF training process,a lot of experimental are done to find out whether the reduction of features can affect the accuracy of Chinese word segmentation and the size of model.Closed evaluations are performed on PKU and MSRA corpus provided by the second international Chinese word segmentation Bakeoff-2005 with the different templates compare to the different experimental data on increasingly parameter value f for one to ten or one to twenty.The final results show: Increase of f parameter value,the accuracy of Chinese word segmentation is always proportionate to the model size,and the decrease of F is far smaller than the model size which generated by training process.
出处
《郑州大学学报(工学版)》
CAS
北大核心
2011年第4期103-106,共4页
Journal of Zhengzhou University(Engineering Science)
基金
国家自然科学基金资助项目(60875081)
河南省教育厅高等学校青年骨干教师资助项目(2009GGJS-108)
关键词
汉语分词
字标注
f阈值
模型大小
CRF++工具包
Chinese word segmentation
character tagging
parameter f
model size
conditional random fields toolkit