摘要
在数字人文视野下,古诗词资源蕴含巨大价值但难以规模化分析。研究古诗词知识库的自动构建方法,有利于从宏观的角度对古诗词进行分析研究,挖掘其中价值。首先,基于“物象”的概念,尝试提取古诗词中所有可能包含情感的客观名物,降低分析复杂度以构建自动化流程;其次,基于深度学习方法构建RoBERTa-BiLSTM-CRF模型,对古诗词语料进行物象抽取;之后,使用《全唐诗》和部分宋代诗词资源验证模型的可行性与泛用性;最后,成功构建《全唐诗》物象库,并初步分析其物象分布规律。使用《全唐诗》自动标注语料训练模型后,模型对普通名词、时间名词和地名识别的F1分值分别达到89.6%、93.3%和93.6%。将模型迁移至未用于训练的宋代诗词语料,抽取密度为每首诗4.5个物象,具备未登录词发现能力,说明模型有良好的泛用性和可扩展性。
From the perspective of digital humanities,ancient poetry resources are of great value but difficult to be analyzed at scale.The research on the automatic construction method of knowledge base of ancient poetry is conducive to the analysis and research of ancient poetry from a macro perspective and the mining of its value.Firstly,based on the concept of “object image”,the key information in ancient poems is extracted to reduce the complexity of analysis to build an automated process.Secondly,roberta-BilstMCRF model is constructed based on deep learning method,and object image is extracted from ancient poetry corpus.Then,The Whole Tang Dynasty Poems and some Song Dynasty poetry resources are used to verify the feasibility and universality of the model.Finally,the object image database of The Whole Tang Dynasty Poems is constructed successfully,and the distribution law of the object images is preliminarily analyzed.After using the automatic tagging corpus training model,the F1 scores of common nouns,time nouns and place names reached 89.6%,93.3% and 93.6% respectively.The model was transferred to the Song Dynasty poetry corpus that was not used for training,and the extraction density was 4.5 objects per poem,which showed the ability to discover unknown words,indicating that the model has good universality and expansibility.
作者
刘懋霖
赵萌
王昊
Liu Maolin;Zhao Meng;Wang Hao(School of Information Management, Nanjing University;Jiangsu Key Laboratory of Data Engineering and Knowledge Service)
出处
《图书馆杂志》
北大核心
2024年第1期96-108,共13页
Library Journal
基金
国家自然科学基金面上项目“关联数据驱动下我国非遗文本的语义解析与人文计算研究”(项目编号:72074108)
南京大学“中央高校基本科研业务费专项资金资助”项目“面向人文计算的方志文本的语义分析和知识图谱研究”(项目编号:010814370113)的研究成果之一。
关键词
数字人文
古诗词
物象
深度学习
Digital humanistic
Ancient poetry
Object image
Deep learning