摘要
文章基于面向古文自然语言处理的SikuBERT和SikuRoBERTa预训练语言模型,在《四库全书》子部14个类别的古籍文本上开展典籍自动分类模型的构建,并与BERT、BERT-wwm、RoBERTa和RoBERTa-wwm基线模型进行对比。文章提出的两种分类模型效果均优于基线模型,SikuBERT模型取得90.39%的整体分类F值,在天文算法类古籍上的分类F值达98.83%。在类别自动识别任务中,SikuRoBERTa的预测正确率达95.30%。基于SikuBERT和SikuRoBERTa预训练语言模型的四库自动分类体系可以将典籍文本划分为所属子部类别,构建的分类工具为高效自动化典籍分类提供了新途径。
The Siku classification system has a far-reaching influence.In order to solve the difficulty of identifying the right category of existing ancient books and provide tools for research in the field of digital humanities,based on SikuBERT and SikuRoBERTa pre-trained language models for natural language processing of ancient Chinese,an automatic classification model of classical texts of a total of 14 categories of books in the“Zi”part of Siku Quanshu is built.It will also be compared with BERT,BERT-wwm,RoBERTa and RoBERTa-wwm baseline models.The new classification method based on the two pre-trained models as proposed in this paper is found better than the baseline models.The SikuBERT model has achieved a classification F-score of 90.39%,and a F-score of98.83%in astronomical calculation books.In the automatic category recognition task,the prediction accuracy of SikuRoBERTa has reached 95.30%.The proposed automatic classification system based on SikuBERT and SikuRoBERTa pre-trained language models can effectively classify classical texts and the classification tool constructed can provide a new way for efficient automatic classification of classical texts.
作者
胡昊天
张逸勤
邓三鸿
王东波
冯敏萱
刘浏
李斌
HU Haotian;ZHANG Yiqin;DENG Sanhong;WANG Dongbo;FENG Minxuan;LIU Liu;LI Bin
出处
《图书馆论坛》
CSSCI
北大核心
2022年第12期138-148,共11页
Library Tribune
基金
国家社科基金重大项目“中国古代典籍跨语言知识库构建及应用研究”(项目编号:21&ZD331)研究成果。