期刊文献+

无监督分词算法在新词识别中的应用 被引量:2

Application of Unsupervised Word Segmentation Algorithm in New Word Recognition
下载PDF
导出
摘要 新词识别过程中,使用分词工具进行预分词的方法,受限于训练语料而对某些领域的分词准确率不佳.针对这个问题,本文提出了一种改进方法.该方法首先基于元语言模型进行无监督预分词,再将词频、互信息和邻接熵作为主要特征进行新词发现.同时方法中还结合了命名实体识别对发现的结果进行过滤,得到候选词组后使用网格搜索寻找最优的超参数组合.实验选取四种不同领域的语料,在统一的超参数下,前10%的新词准确率分别达到了88.3%、80.5%、85.9%、91.9%.实验表明,这种无监督的分词方法适用于新词识别领域,并具备良好的领域适应性. In new word recognition,the method which uses the common tools for word segmentation is not good in some fields because of the specific training corpus.This paper proposes an improved method for the problem.Firstly,we segment the word for an unsupervised method based on N-gram language model,and then use some features to discover new words including word frequency,mutual information and branch entropy.At the same time,the method also combines the named entity recognition to filter the results.And after obtaining the candidate words,the grid search method is used to find the optimal hyperparameter combination.We selected four different fields of corpus in the experiment.Under the same hyperparameters,the accuracy of the top 10%of new words reached 88.3%,80.5%,85.9%,and 91.9%,respectively.Experiments show that this unsupervised word segmentation method is available and has a good adaptability in the new word recognition.
作者 姜涛 陆阳 张洁 洪建 JIANG Tao;LU Yang;ZHANG Jie;HONG Jian(School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230601,China;Engineering Research Center of Safety Critical Industry Measure and Control Technology,Ministry of Education,Hefei 230601,China;Information Center,The First Affiliated Hospital of Anhui Medical University,Hefei 230022,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2020年第4期888-892,共5页 Journal of Chinese Computer Systems
基金 安徽省教育厅重点项目(SK2018A0154)资助 国家重点研发计划专项项目(2016YFC0801804)资助。
关键词 新词识别 互信息 邻接熵 N元语言模型 中文分词 new word recognition mutual information branch entropy n-gram language model Chinese word segmentation
  • 相关文献

参考文献12

二级参考文献116

共引文献472

同被引文献12

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部