期刊文献+

数字人文视域下面向历史古籍的信息抽取方法研究 被引量:4

Research on information extraction methods for historical classics under the threshold of digital humanities
下载PDF
导出
摘要 数字人文旨在采用现代计算机网络技术助力传统人文研究,文言历史古籍是进行历史研究和学习的重要基础,但由于其写作语言为文言文,与现代所用的白话文在语法和词义上均有较大差别,因此不易于阅读和理解。针对上述问题,提出基于预训练模型对历史古籍中的实体和关系等进行知识抽取的方法,从而有效获取历史古籍文本中蕴含的丰富信息。该模型首先采用多级预训练任务代替BERT原有的预训练任务,以充分捕获语义信息,此外在BERT模型的基础上添加了卷积层及句子级聚合等结构,以进一步优化生成的词表示。然后,针对文言文标注数据稀缺的问题,构建了一个面向历史古籍文本标注任务的众包系统,获取高质量、大规模的实体和关系数据,完成文言文知识抽取数据集的构建,评估模型性能,并对模型进行微调。在构建的数据集及GulianNER数据集上的实验证明了提出模型的有效性。 Digital humanities aims to use modern computer network technology to help traditional humanities research.Classical Chinese historical books are the important basis for historical research and learning,but since their writing language is classical Chinese,it is quite different from the vernacular Chinese in grammar and meaning,so it is not easy to read and understand.In view of the above problems,the solution to extract entities and relations in historical books based on pre-trained models was proposed to obtain the rich information contained in historical texts effectively.The model usedmulti-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information.And the model added some structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to optimize the generated word representation further.Then,in view of the scarcity of classical Chinese annotation data,a crowdsourcing system for the task of labeling historical classics was constructed,high-quality,large-scale entity and relation data was obtained and the classical Chinese knowledge extraction dataset was constructed.So it helped to evaluate the performance of the model and fine-tune the model.Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrated the effectiveness of the model proposed in this paper.
作者 韩立帆 季紫荆 陈子睿 王鑫 HAN Lifan;JI Zijing;CHEN Zirui;WANG Xin(College of Intelligence and Computing,Tianjin University,Tianjin 300350,China;Tianjin Key Laboratory of Cognitive Computing and Application,Tianjin 300350,China)
出处 《大数据》 2022年第6期26-39,共14页 Big Data Research
基金 科技创新2030—“新一代人工智能”重大项目(No.2020AAA0108504) 国家自然科学基金资助项目(No.61972275)。
关键词 历史古籍 预训练模型 信息抽取 众包机制 historical classics pre-trained model information extraction crowdsourcing mechanism
  • 相关文献

参考文献1

二级参考文献11

共引文献39

同被引文献48

引证文献4

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部