期刊文献+

基于非参数贝叶斯模型和深度学习的古文分词研究 被引量:16

Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning
下载PDF
导出
摘要 古汉语文本中,汉字通常连续书写,词与词之间没有明显的分割标记,为现代人理解古文乃至文化传承带来许多障碍。自动分词是自然语言处理技术的基础任务之一。主流的自动分词方法需要大量人工分词语料训练,费时费力,古文分词语料获取尤其困难,限制了主流自动分词方法的应用。该文将非参数贝叶斯模型与BERT(Bidirectional Encoder Representations from Transformers)深度学习语言建模方法相结合,进行古文分词研究。在《左传》数据集上,该文提出的无监督多阶段迭代训练分词方法获得的F1值为93.28%;仅使用500句分词语料进行弱监督训练时,F1值可达95.55%,高于前人使用6/7语料(约36 000句)进行有监督训练的结果;使用相同规模训练语料时,该文方法获得的F1值为97.40%,为当前最优结果。此外,该文方法还具有较好的泛化能力,模型代码已开源发布。 All the Chinese characters in ancient Chinese texts are written continuously, without obvious segmentation marks between words. This brings great challenges to text understanding and even cultural inheritance. To deal with word segmentation for ancient Chinese texts, we propose the Multi-Stage Iterative Training(MSIT) for unsupervised word segmentation by combining non-parametric Bayesian models with BERT(Bidirectional Encoder Representations from Transformers). It achieves the F1 score of 93.28% on Zuozhuan(an ancient Chinese history book) dataset. After adding only 500 ground truth sentences, which can be considered as weakly supervised learning, the F1 score reaches 95.55%. It outperforms the previous best result, which trains on 6/7 of the Zuozhuan dataset(about 36,000 ground truth sentences). When using the same training set, our method gets the F1 score of 97.40%, the state-of-the-art result. Our proposed method is not only better than traditional sequence labeling algorithms including BERT model, but also proved that it has better generalization ability by experiments. The model and related codes are available online.
作者 俞敬松 魏一 张永伟 杨浩 YU Jingsong;WEI Yi;ZHANG Yongwei;YANG Hao(School of Software and Microelectronics,Peking University,Beijing 100871,China;Institute of Linguistics,Chinese Academy of Social Sciences,Beijing 100732,China;Editorial and Research Center of Confucian Canon,Peking University,Beijing 100871,China)
出处 《中文信息学报》 CSCD 北大核心 2020年第6期1-8,共8页 Journal of Chinese Information Processing
基金 国家自然科学基金(61876004)
关键词 古文分词 非参数贝叶斯模型 深度学习 无指导学习 弱指导学习 word segmentation for ancient Chinese texts nonparametric Bayesian models deep learning unsupervised learning weakly supervised learning
  • 相关文献

参考文献7

二级参考文献52

共引文献142

同被引文献221

引证文献16

二级引证文献87

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部