摘要
【目的】为古籍补全任务提供一种基于预训练语言模型的新方法,利用不同语义层次和简繁体预训练语言模型获得的表示,构建混合专家系统和简繁融合模型实现古籍补全。【方法】针对传世文献和出土文献分别设计基于混合专家系统的模型和简繁融合模型,在不同场景下充分融合与挖掘模型能力,进一步提升模型古籍补全的能力。【结果】使用自行构建的传世文献数据集以及出土文献数据集,补全任务的准确率分别达到70.14%和57.13%。【局限】只从自然语言处理角度出发,未来可以利用多模态技术,计算机视觉与自然语言处理相结合,整合图像信息和语义信息两个维度,可能会有更好的效果。【结论】在构建的传世文献和出土文献数据集上进行验证,达到较高的准确率,为古籍补全任务提供了一种具有竞争力的解决思路。
[Objective]This paper proposes a new method based on pre-trained language models for completing ancient texts,utilizing representations obtained from pre-training models at different semantic levels and for simplified and traditional Chinese characters.The method constructs a mixture-of-experts system and a simplifiedtraditional Chinese fusion model to complete ancient texts.[Methods]We designed the mixture-of-experts systembased model for transmitted texts and constructed the simplified-traditional Chinese character fusion model for excavated literature.We fully integrated and explored the model’s capabilities in different scenarios to improve its ability to complete ancient texts.[Results]We examined the new models with self-constructed datasets of transmitted and excavated texts.The models achieved accuracy of 70.14% and 57.13% for the completion task.[Limitations]We only utilized natural language processing approaches.Future improvements involve leveraging multimodal techniques,combining computer vision with natural language processing,and integrating image and semantic information to yield better results.[Conclusions]The proposed models achieve high accuracy on the constructed datasets of ancient literature,providing a competitive solution for completing ancient texts.
作者
李嘉俊
明灿
郭志浩
钱铁云
彭智勇
王晓光
李旭晖
李静
Li Jiajun;Ming Can;Guo Zhihao;Qian Tieyun;Peng Zhiyong;Wang Xiaoguang;Li Xuhui;Li Jing(School of Computer Science,Wuhan University,Wuhan 430072,China;Intellectual Computing Laboratory for Cultural Heritage,Wuhan University,Wuhan 430072,China;School of Information Management,Wuhan University,Wuhan 430072,China;School of History,Wuhan University,Wuhan 430072,China)
出处
《数据分析与知识发现》
EI
CSCD
北大核心
2024年第5期59-67,共9页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金重大项目(项目编号:21&ZD334)的研究成果之一。
关键词
古籍数字化
预训练语言模型
混合专家系统
Digitization of Ancient Books
Pre-trained Language Models
Mixture-of-Experts Systems