摘要
古汉语文本中,汉字通常连续书写,词与词之间没有明显的分割标记,为现代人理解古文乃至文化传承带来许多障碍。自动分词是自然语言处理技术的基础任务之一。主流的自动分词方法需要大量人工分词语料训练,费时费力,古文分词语料获取尤其困难,限制了主流自动分词方法的应用。该文将非参数贝叶斯模型与BERT(Bidirectional Encoder Representations from Transformers)深度学习语言建模方法相结合,进行古文分词研究。在《左传》数据集上,该文提出的无监督多阶段迭代训练分词方法获得的F1值为93.28%;仅使用500句分词语料进行弱监督训练时,F1值可达95.55%,高于前人使用6/7语料(约36 000句)进行有监督训练的结果;使用相同规模训练语料时,该文方法获得的F1值为97.40%,为当前最优结果。此外,该文方法还具有较好的泛化能力,模型代码已开源发布。
All the Chinese characters in ancient Chinese texts are written continuously, without obvious segmentation marks between words. This brings great challenges to text understanding and even cultural inheritance. To deal with word segmentation for ancient Chinese texts, we propose the Multi-Stage Iterative Training(MSIT) for unsupervised word segmentation by combining non-parametric Bayesian models with BERT(Bidirectional Encoder Representations from Transformers). It achieves the F1 score of 93.28% on Zuozhuan(an ancient Chinese history book) dataset. After adding only 500 ground truth sentences, which can be considered as weakly supervised learning, the F1 score reaches 95.55%. It outperforms the previous best result, which trains on 6/7 of the Zuozhuan dataset(about 36,000 ground truth sentences). When using the same training set, our method gets the F1 score of 97.40%, the state-of-the-art result. Our proposed method is not only better than traditional sequence labeling algorithms including BERT model, but also proved that it has better generalization ability by experiments. The model and related codes are available online.
作者
俞敬松
魏一
张永伟
杨浩
YU Jingsong;WEI Yi;ZHANG Yongwei;YANG Hao(School of Software and Microelectronics,Peking University,Beijing 100871,China;Institute of Linguistics,Chinese Academy of Social Sciences,Beijing 100732,China;Editorial and Research Center of Confucian Canon,Peking University,Beijing 100871,China)
出处
《中文信息学报》
CSCD
北大核心
2020年第6期1-8,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61876004)
关键词
古文分词
非参数贝叶斯模型
深度学习
无指导学习
弱指导学习
word segmentation for ancient Chinese texts
nonparametric Bayesian models
deep learning
unsupervised learning
weakly supervised learning