期刊文献+

SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究 被引量:39

Construction and Application of Pre-trained Models of Siku Quanshu in Orientation to Digital Humanities
下载PDF
导出
摘要 数字人文研究需要大规模语料库和高性能古文自然语言处理工具的支持。面向英语和现代汉语的预训练语言模型已在相关领域提升了文本挖掘精度,数字人文研究的兴起亟须面向古文自动处理领域的预训练模型。文章以校验后的高质量《四库全书》全文语料作为无监督训练集,基于BERT模型框架,构建面向古文智能处理任务的SikuBERT和SikuRoBERTa预训练语言模型。实验进一步设计面向《左传》语料的古文自动分词、断句标点、词性标注和命名实体识别等验证任务,分别对SikuBERT、SikuRoBERTa预训练模型和其他3种基线模型(BERT-base、RoBERTa、GuwenBERT)进行对比试验。结果显示:SikuBERT和SikuRoBERTa模型在全部4个下游验证任务中的表现均超越其他基线预训练模型,表明文章提出的预训练模型具有较强的古文词法、句法、语境学习能力和泛化能力。基于验证任务效果最优的SikuRoBERTa预训练模型,进一步构建“SIKUBERT典籍智能处理平台”。该平台提供典籍自动处理、检索和自动翻译等在线服务,可以辅助哲学、文学、历史学等领域学者在不具备数据挖掘与深度学习的专业背景下,以直观可视化方式对典籍文本进行高效率、多维度、深层次、细粒化的知识挖掘与分析。 Digital humanities research needs the support of large-scale text corpuses and high performance natural language processing tools.Pre-trained language models for English and modern Chinese have greatly improved the accuracy of text mining in related fields.The rise of digital humanities research calls for pre-trained models for automatic processing of ancient texts.Based on the BERT model framework,SikuBERT and SikuRoBERTa pretrained language models for intelligent processing of ancient texts are constructed by using the proofread and highquality full-text corpus of Siku Quanshu as the unsupervised training set.In this study,automatic word segmentation,sentence punctuation,part-of-speech tagging and named entity recognition tasks for the corpus of Zuo Zhuan are further designed.SikuBERT and SikuRoBERTa pre-trained models are used to compare with three other baseline models,i.e.,BERT-base,RoBERTa and GuwenBERT.The results show that the performances of SikuBERT and SikuRoBERTa models in all four downstream validation tasks are better than other benchmark pretrained models.This fact indicates that the pre-trained models as proposed in this paper have a stronger ability to learn the morphology,syntax,context and generalization of ancient Chinese texts.Furthermore,based on the SikuRoBERTa pre-trained model with the best verification task effect,this paper constructs the"SikuBERT intelligent processing platform for classical books".The platform provides books with three types of online services,i.e.,automatic processing,retrieval and translation.These services can help scholars in areas such as philosophy,literature and history,who do not have the professional background of data mining and deep learning,to gain highly efficient,multi-dimensional,in-depth and refined knowledge mining and analysis through intuitive visual text techniques.
作者 王东波 刘畅 朱子赫 刘江峰 胡昊天 沈思 李斌 WANG Dongbo;LIU Chang;ZHU Zihe;LIU Jiangfeng;HU Haotian;SHEN Si;LI Bin
出处 《图书馆论坛》 CSSCI 北大核心 2022年第6期30-43,共14页 Library Tribune
基金 国家社会科学基金重大项目“中国古代典籍跨语言知识库构建及应用研究”(项目编号:21&ZD331) 江苏省社会科学基金青年项目“人文计算视角下的先秦人物知识获取及分析研究”(项目编号:19TQC003)研究成果。
关键词 数字人文 四库全书 预训练模型 深度学习 digital humanities Siku Quanshu pre-trained models deep learning
  • 相关文献

参考文献11

二级参考文献131

共引文献398

同被引文献476

引证文献39

二级引证文献98

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部