期刊文献+

数字人文视域下SikuBERT增强的史籍实体识别研究 被引量:14

Research on SikuBERT-enhanced Entity Recognition of Historical Records from the Perspective of Digital Humanities
下载PDF
导出
摘要 利用自然语言处理技术深入挖掘典籍文献,推进中文古籍文献的数字化,对于推动历史学习、增强文化自信与促进文明传播具有重要意义。命名实体识别研究是自然语言处理中的基础性环节,文章基于BERT-base、RoBERTa、GuwenBERT、SikuBERT、SikuRoBERTa等预训练模型,以“前四史”和《左传》为研究语料,构建人名、地名、时间等命名实体识别任务。实验结果表明:SikuBERT、SikuRoBERTa在无标点语料、小范围语料上能够取得较基准模型更好的效果;语体风格、语料规模对模型性能产生一定影响;BERT模型更为适应大规模语料任务。实验验证了基于《四库全书》繁体语料预训练的BERT模型在预训练-微调范式下典籍命名实体识别的可行性,构建了基于SikuBERT的典籍命名实体识别软件,为进一步开展典籍文本挖掘和利用提供参考。 Classical books and documents are treasures of excellent Chinese traditional civilization.It is of great significance to use natural language processing technology to dig deeply into these books and literature.Digitization of ancient Chinese books and literature can promote the study of history,enhance cultural confidence and facilitate the spread of civilization.Named entity recognition is a fundamental step in natural language processing.Based on BERT-base,RoBERTa,GuwenBERT,SikuBERT and SikuRoBERTa pre-trained models,this paper uses"The First Four Histories"and Zuo Zhuan as the research corpus to provide named entity recognition tasks of personal names,place names and time periods.The experiment shows that SikuBERT and SikuRoBERTa can achieve better results on non-punctuated corpus and corpus of small-scale than other benchmark models.Linguistic style and corpus size have some influence on model performance;and the BERT model is more suitable for large-scale corpus tasks.The experiment confirms the feasibility of applying BERT pre-trained models on the traditional corpus of Siku Quanshu.Under pre-trained and fine-tuning paradigms for named entity recognition of classical books,a named entity recognition software for classical books based on SikuBERT is developed.This will provide a good reference for further text mining and utilization of classical books.
作者 刘江峰 冯钰童 王东波 胡昊天 张逸勤 LIU Jiangfeng;FENG Yutong;WANG Dongbo;HU Haotian;ZHANG Yiqin
出处 《图书馆论坛》 CSSCI 北大核心 2022年第10期61-72,共12页 Library Tribune
基金 国家社会科学基金重大项目“中国古代典籍跨语言知识库构建及应用研究”(项目编号:21&ZD331)研究成果。
关键词 人文计算 SikuBERT 预训练模型 史籍 实体识别 humanities computing SikuBERT pre-trained models historical records entity recognition
  • 相关文献

参考文献30

二级参考文献320

共引文献652

同被引文献313

引证文献14

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部