期刊文献+

使用无监督学习改进中文分词 被引量:8

Improving Chinese Word Segmentation Via Unsupervised Learning
下载PDF
导出
摘要 针对互联网语料中的未登录词问题,提出一种基于无监督学习的中文分词改进算法.使用基准分词器对未标注的语料进行分词,选择适合于未登录词发现的模型进行无监督训练得到词向量,并使用词向量结果贪心地发现未登录词,修正分词结果.在传统中文语料上与互联网语料上,比较了基于字典的字符串匹配模型与基于字符标注的机器学习模型的分词效果.实验结果表明,改进算法可以提升中文分词效果,在互联网语料上的提升效果尤为明显.改进算法在PKU语料上取得了最多1.1%的F值提升,在MSR语料上取得了最多1.2%的F值提升,在互联网语料上取得了最多5%的F值提升. The challenge of out-of-vocabulary ( OOV ) words makes Chinese word segmentation ( CWS ) tools behave poorly when processing internet corpus. An unsupervised learning-based algorithm was proposed to improve CWS performance. A baseline CWS tool was used to generate temporal segmentation results over unlabeled corpus and the temporal segmentation results were used to learn distributed word representations. Finally, the distributed word representations were used to tune the segmentation results in a greedy way. Dictionary-based phrase matching approach and character-based machine learning approach were compared on traditional corpus and internet corpus. The experiment results show that the proposed approach will improve CWS performance, especially over the inter- net corpus. It is shown that the proposed approach will improve the F-score over the PKU corpus by up to 1.1%, will improve the F- score over the MSR corpus by up to 1.2% ,and will improve the F-score over the intemet corpus by up to 5%.
出处 《小型微型计算机系统》 CSCD 北大核心 2017年第4期744-748,共5页 Journal of Chinese Computer Systems
关键词 中文分词 词向量 无监督学习 未登陆词 分词优化 互联网语料 Chinese word segmentation ( CWS ) word embeddings unsupervised learning out-of-vocabulary ( OOV ) words CWS optimization intemet corpus
  • 相关文献

参考文献1

二级参考文献6

共引文献247

同被引文献85

引证文献8

二级引证文献47

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部