期刊文献+

基于规则和N-Gram算法的新词识别研究 被引量:5

Research on new word recognition based on rules and N-Gram algorithm
下载PDF
导出
摘要 当前的分词工具分词后会出现很多单字碎片,分词之后意义与原意相差甚远。同时因为新词的构词规则具有自由度大的特点,当前分词方法不能有效识别网络中的新词。在ICTCLAS2016分词系统的基础上,结合新词结构制定规则构建碎片库,利用Bi-gram和Tri-gram模式提取碎片库中的候选字串,再采用左右邻接熵进行扩展及过滤,最后提出基于规则和N-Gram算法的新词识别方法。结果表明使用该方法的分词效果准确率、召回率和F值都有所提高。实验结果表明,该新词识别方法能有效构造候选新词集合,提高中文分词效果。 A lot of word fragments can be produced and the meanings after word segmentation are very different from original meanings after word segmentation using the current word segmentation tool,and the formation rules of new words have the characteristic of high freedom degree.As a result,the current word segmentation method cannot effectively identify new words in network.The fragment library is constructed combining the formation rules of new word structures on the basis of the ICT-CLAS2016 word segmentation system.The Bi-gram and Tri-gram modes are adopted to extract the candidate word strings in the fragment library.The left and right adjacent entropies are used for expansion and filtering of the candidate word strings.A new word recognition method based on rules and N-Gram algorithm is proposed.The results show that the word segmentation accuracy,recall rate and F values of the method are improved.The experimental results show that the new word recognition method can effectively construct the candidate new word sets and improve the effect of Chinese word segmentation.
作者 姜如霞 黄水源 段隆振 罗丽娟 JIANG Ruxia;HUANG Shuiyuan;DUAN Longzhen;LUO Lijuan(School of Information Engineering,Nanchang University,Nanchang 330031,China)
出处 《现代电子技术》 北大核心 2019年第4期166-170,共5页 Modern Electronics Technique
基金 国家自然科学基金资助项目(61070139) 国家自然科学基金资助项目(81460769)~~
关键词 新词识别 N-Gram算法 构词规则 中文分词 碎片库 召回率 new word recognition N-Gram algorithm word formation rule Chinese word segmentation fragment library recall rate
  • 相关文献

参考文献9

二级参考文献63

共引文献104

同被引文献66

引证文献5

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部