期刊文献+

基于SWPF2vec和DJ-TextRCNN的古籍文本主题分类研究

Topic Classification of Ancient Texts Based on SWPF2vec and DJ-TextRCNN
下载PDF
导出
摘要 以编目分类和规则匹配为主的古籍文本主题分类方法存在工作效能低、专家知识依赖性强、分类依据单一化、古籍文本主题自动分类难等问题。对此,本文结合古籍文本内容和文字特征,尝试从古籍内容分类得到符合研究者需求的主题,推动数字人文研究范式的转型。首先,参照东汉古籍《说文解字》对文字的分析方式,以前期标注的古籍语料数据集为基础,构建全新的“字音(说)-原文(文)-结构(解)-字形(字)”四维特征数据集。其次,设计四维特征向量提取模型(speaking,word,pattern,and font to vector,SWPF2vec),并结合预训练模型实现对古籍文本细粒度的特征表示。再其次,构建融合卷积神经网络、循环神经网络和多头注意力机制的古籍文本主题分类模型(dianji-recurrent convolutional neural networks for text classification,DJ-TextRCNN)。最后,融入四维语义特征,实现对古籍文本多维度、深层次、细粒度的语义挖掘。在古籍文本主题分类任务上,DJ-TextRCNN模型在不同维度特征下的主题分类准确率均为最优,在“说文解字”四维特征下达到76.23%的准确率,初步实现了对古籍文本的精准主题分类。 The method for classifying topics in ancient book texts,mainly based on cataloging and rule matching,encoun-ters challenges such as low efficiency,heavy reliance on expert knowledge,a single classification basis,and difficulties in automating the classification process.In addressing these issues,this study attempts to classify themes that meet the re-searchers’needs based on the content and characteristics of ancient texts,and promote the transformation of digital human-ities research paradigms.First,referring to the analysis method of characters in the ancient book Analytical Dictionary of Characters(Shuowen Jiezi)of the Eastern Han Dynasty,a new four-dimensional feature dataset of“pronunciation(speak-ing)-original text(text)-structure(pattern)-glyph(font)”is constructed based on the corpus dataset of ancient books.Second,a four-dimensional feature vector extraction model(speaking,word,pattern,and font to vector;SWPF2vec)is de-signed and combined with a pre-trained model to achieve fine-grained feature representation of ancient texts.Once again,the ancient text topic classification model(dianji-recurrent convolutional neural networks for text classification;DJ-Tex-tRCNN)is constructed by fusing convolutional neural networks,recurrent neural networks,and multi-head attention mech-anism.Finally,multidimensional,deep-level,and fine-grained semantic mining of ancient texts is achieved by integrating four-dimensional semantic features.DJ-TextRCNN exhibits the best accuracy in topic classification under different dimen-sional features,achieving an accuracy of 76.23%under the four-dimensional feature of“shuo,wen,jie,zi,”preliminarily achieving accurate topic classification of ancient book texts.
作者 武帅 杨秀璋 何琳 公佐权 Wu Shuai;Yang Xiuzhang;He Lin;Gong Zuoquan(College of Information Management,Nanjing Agricultural University,Nanjing 211800;Guizhou Big Data Academy,Guizhou University,Guiyang 550025;School of Cyber Science and Engineering,Wuhan University,Wuhan 430030;School of Information,Guizhou University of Finance and Economics,Guiyang 550025)
出处 《情报学报》 CSCD 北大核心 2024年第5期601-615,共15页 Journal of the China Society for Scientific and Technical Information
基金 国家社会科学基金重大项目“先秦诸子典籍知识库建设及词典编纂”(22&ZD262) 贵州省科技厅基础项目“基于大数据及图像识别的水族文献及濒危水书抢救性整理研究”(黔科合基础[2020]1Y279)。
关键词 多维特征融合 古籍文本 主题分类 SWPF2vec DJ-TextRCNN multi-dimensional feature fusion ancient texts topic classification SWPF2vec DJ-TextRCNN
  • 相关文献

参考文献24

二级参考文献347

共引文献527

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部