数字人文视域下SikuBERT增强的史籍实体识别研究被引量：14

Research on SikuBERT-enhanced Entity Recognition of Historical Records from the Perspective of Digital Humanities

下载PDF

导出

摘要利用自然语言处理技术深入挖掘典籍文献,推进中文古籍文献的数字化,对于推动历史学习、增强文化自信与促进文明传播具有重要意义。命名实体识别研究是自然语言处理中的基础性环节,文章基于BERT-base、RoBERTa、GuwenBERT、SikuBERT、SikuRoBERTa等预训练模型,以“前四史”和《左传》为研究语料,构建人名、地名、时间等命名实体识别任务。实验结果表明:SikuBERT、SikuRoBERTa在无标点语料、小范围语料上能够取得较基准模型更好的效果;语体风格、语料规模对模型性能产生一定影响;BERT模型更为适应大规模语料任务。实验验证了基于《四库全书》繁体语料预训练的BERT模型在预训练-微调范式下典籍命名实体识别的可行性,构建了基于SikuBERT的典籍命名实体识别软件,为进一步开展典籍文本挖掘和利用提供参考。 Classical books and documents are treasures of excellent Chinese traditional civilization.It is of great significance to use natural language processing technology to dig deeply into these books and literature.Digitization of ancient Chinese books and literature can promote the study of history,enhance cultural confidence and facilitate the spread of civilization.Named entity recognition is a fundamental step in natural language processing.Based on BERT-base,RoBERTa,GuwenBERT,SikuBERT and SikuRoBERTa pre-trained models,this paper uses"The First Four Histories"and Zuo Zhuan as the research corpus to provide named entity recognition tasks of personal names,place names and time periods.The experiment shows that SikuBERT and SikuRoBERTa can achieve better results on non-punctuated corpus and corpus of small-scale than other benchmark models.Linguistic style and corpus size have some influence on model performance;and the BERT model is more suitable for large-scale corpus tasks.The experiment confirms the feasibility of applying BERT pre-trained models on the traditional corpus of Siku Quanshu.Under pre-trained and fine-tuning paradigms for named entity recognition of classical books,a named entity recognition software for classical books based on SikuBERT is developed.This will provide a good reference for further text mining and utilization of classical books.

作者刘江峰冯钰童王东波胡昊天张逸勤 LIU Jiangfeng;FENG Yutong;WANG Dongbo;HU Haotian;ZHANG Yiqin

机构地区南京农业大学信息管理学院南京大学信息管理学院

出处《图书馆论坛》 CSSCI 北大核心 2022年第10期61-72,共12页 Library Tribune

基金国家社会科学基金重大项目“中国古代典籍跨语言知识库构建及应用研究”(项目编号:21&ZD331)研究成果。

关键词人文计算 SikuBERT 预训练模型史籍实体识别 humanities computing SikuBERT pre-trained models historical records entity recognition

分类号 G250.7 [文化科学—图书馆学] G255.1 [文化科学—图书馆学]