基于预训练语言模型的古籍文本智能补全研究

Data Analysis and Knowledge Discovery Intelligent Completion of Ancient Texts Based on Pre-trained Language Models

导出

摘要【目的】为古籍补全任务提供一种基于预训练语言模型的新方法,利用不同语义层次和简繁体预训练语言模型获得的表示,构建混合专家系统和简繁融合模型实现古籍补全。【方法】针对传世文献和出土文献分别设计基于混合专家系统的模型和简繁融合模型,在不同场景下充分融合与挖掘模型能力,进一步提升模型古籍补全的能力。【结果】使用自行构建的传世文献数据集以及出土文献数据集,补全任务的准确率分别达到70.14%和57.13%。【局限】只从自然语言处理角度出发,未来可以利用多模态技术,计算机视觉与自然语言处理相结合,整合图像信息和语义信息两个维度,可能会有更好的效果。【结论】在构建的传世文献和出土文献数据集上进行验证,达到较高的准确率,为古籍补全任务提供了一种具有竞争力的解决思路。 [Objective]This paper proposes a new method based on pre-trained language models for completing ancient texts,utilizing representations obtained from pre-training models at different semantic levels and for simplified and traditional Chinese characters.The method constructs a mixture-of-experts system and a simplifiedtraditional Chinese fusion model to complete ancient texts.[Methods]We designed the mixture-of-experts systembased model for transmitted texts and constructed the simplified-traditional Chinese character fusion model for excavated literature.We fully integrated and explored the model’s capabilities in different scenarios to improve its ability to complete ancient texts.[Results]We examined the new models with self-constructed datasets of transmitted and excavated texts.The models achieved accuracy of 70.14% and 57.13% for the completion task.[Limitations]We only utilized natural language processing approaches.Future improvements involve leveraging multimodal techniques,combining computer vision with natural language processing,and integrating image and semantic information to yield better results.[Conclusions]The proposed models achieve high accuracy on the constructed datasets of ancient literature,providing a competitive solution for completing ancient texts.

作者李嘉俊明灿郭志浩钱铁云彭智勇王晓光李旭晖李静 Li Jiajun;Ming Can;Guo Zhihao;Qian Tieyun;Peng Zhiyong;Wang Xiaoguang;Li Xuhui;Li Jing(School of Computer Science,Wuhan University,Wuhan 430072,China;Intellectual Computing Laboratory for Cultural Heritage,Wuhan University,Wuhan 430072,China;School of Information Management,Wuhan University,Wuhan 430072,China;School of History,Wuhan University,Wuhan 430072,China)

机构地区武汉大学计算机学院武汉大学文化遗产智能计算实验室武汉大学信息管理学院武汉大学历史学院

出处《数据分析与知识发现》 EI CSCD 北大核心 2024年第5期59-67,共9页 Data Analysis and Knowledge Discovery

基金国家社会科学基金重大项目(项目编号:21&ZD334)的研究成果之一。

关键词古籍数字化预训练语言模型混合专家系统 Digitization of Ancient Books Pre-trained Language Models Mixture-of-Experts Systems

分类号 G350 [文化科学—情报学] TP391 [自动化与计算机技术—计算机应用技术]