期刊文献+

面向海量典籍文本的深度学习自动断句与标点平台构建研究 被引量:10

Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature
原文传递
导出
摘要 【目的】促进数字化古汉语的组织与利用,通过建立标注体系并构建层叠深度学习模型实现古汉语自动断句与标点,从而推动人文社科领域的发展。【方法】以《四库全书》构成海量典籍的语料库,将自动断句与标点作为序列标注问题研究,确定层叠式的思路。通过构建BERT-LSTM-CRF模型得到未断句古文的自动断句结果,并将该结果作为新的特征,输入到多特征LSTM-CRF模型,迭代学习,最终给出标点标记。利用训练出的模型,在Django框架下搭建相应的应用平台。【结果】实验结果表明,在大规模语料下,本文方法针对经、史、子、集4部自动断句与标点的调和平均值分别为86.41%与90.84%。【局限】对于标点体系的处理有待细化。【结论】所利用的模型显著提升任务效果,所搭建的应用平台实现是数字人文工程化的体现。 [Objective] This study establishes an annotation system with cascaded deep learning model, aiming to automatically conduct sentence segmentation and punctuation for ancient Chinese literature. [Methods] First, we created a massive corpus of Chinese books from"Siku Quanshu". Then, we studied the automatic sentence segmentation and punctuation as sequence labeling issues, and determined the cascaded ideas. Third, we obtained the results of automatic sentence segmentation for the uninterrupted sentences based on the BERT-LSTM-CRF model. Fourth, we processed these results with the multi-feature LSTM-CRF model and received the final punctuation marks after iterative learning. [Results] We built an application platform with the trained model and the Django framework. The average F values of the proposed method for automatic sentence segmentation and punctuation were 86.41% and 90.84%, respectively. [Limitations] The punctuation system needs to be refined.[Conclusions] The proposed model and platform significantly improve the sentence segmentation and punctuation of ancient Chinese literature, which benefits digital humanity and social science projects in China.
作者 王倩 王东波 李斌 许超 Wang Qian;Wang Dongbo;Li Bin;Xu Chao(College of Information Management,Nanjing Agricultural University,Nanjing 210095,China;Research Center for Correlation of Domain Knowledge,Nanjing Agricultural University,Nanjing 210095,China;College of Literature,Nanjing Nonnal University,Nanjing 210097,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2021年第3期25-34,共10页 Data Analysis and Knowledge Discovery
基金 国家自然科学基金面上项目(项目编号:71673143) 国家社会科学基金重大项目(项目编号:15ZDB127)的研究成果之一。
关键词 自动断句 数字人文 BERT 古汉语 Automatic Sentence Segmentation Digital Humanities BERT Ancient Chinese
  • 相关文献

参考文献7

二级参考文献42

共引文献51

同被引文献180

引证文献10

二级引证文献68

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部