期刊文献+

基于N-gram的双向匹配中文分词方法 被引量:12

Bi-Direction Matching Chinese Word Segmentation Based on N-gram Statistical Model
原文传递
导出
摘要 针对基础词更能表达中文文本所包含的基本信息,更适合于后续的文本挖掘,提出一种基于N-gram的双向匹配中文分词方法。充分挖掘训练语料的词频信息,给出一种组合词迭代切分方法,解决最大匹配分词中长词歧义切分问题,并基于N-gram语言模型,实现最优分词序列的选择。此外,为弥补准确率P这一评价指标受词条长度影响较大而不稳健的问题,在刻画分词方法性能时引入正确切分词条总字数这一因素,提出一个新的测评指标Pn,有效规避了词条长度对分词准确率评价的影响。最后在SIGHAN组织的国际中文自然语言处理竞赛的两个语料上进行实验表明,相较于传统N-gram中文分词方法,本文方法在保证分词效率的前提下,有效地提高了准确率P、召回率R、Pn和F1值。 Aiming at the problem that basic words can define the basic information contained in Chinese text more clearly and are better used to subsequent text mining,a bi-direction matching Chinese word segmentation method based on N-gram statistical model is provided.An iterative segmentation method of combined words is formulated to solve the problem of long word ambiguity in the maximum matching algorithm by fully mining the word frequency information of the training corpus.And the optimal word segmentation sequence can be selected based on the N-gram statistical language model.In addition,due to the problem that the accuracy P is greatly influenced by the length of words,a new evaluation index Pn based on the total number of accumulative correct words is proposed.The new evaluation index has better robustness,is an additional evaluation of the Chinese word segmentation.On the two experimental corpuses of SIGHAN International Chinese Natural Language Processing Competition,the experimental results and analysis are given.The results show that the accuracy P,recall rate R,F1 value and Pn are better than the N-gram Chinese word segmentation method with the same efficiency of word segmentation.
作者 凤丽洲 杨贵军 徐雪 徐玉慧 FENG Li-zhou;YANG Gui-ju n;XU Xue;XU Yu-hui(School of Statistics,Tianjin University of Finance—Economics,Tianjin 300222,China;School of Science,Tianjin University of Commerce,Tianjin 300134,China;China United Network Communication Group Co.,Ltd.Qingdao Branch,Qingdao 266000,China)
出处 《数理统计与管理》 CSSCI 北大核心 2020年第4期633-643,共11页 Journal of Applied Statistics and Management
基金 国家社会科学基金项目青年项目(18CTJ008) 天津市自然科学基金项目青年项目(18JCQNJC69600) 国家自然科学基金项目面上项目(11471239) 全国统计科学研究计划重点项目(2017LZ25,2017LZ05) 全国统计科学研究一般项目(2018LY50) 天津市社科规划重点课题(TJTJ19-001)。
关键词 N-GRAM模型 分词歧义 评测指标 双向匹配 N-gram segmentation ambiguity evaluation index bi-direction matching
  • 相关文献

参考文献19

二级参考文献169

共引文献482

同被引文献134

引证文献12

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部