期刊文献+

基于数据驱动方法的历史报纸词汇变化研究

A Data-driven Approach to Studying Changing Vocabularies in Historical Newspaper Collections
原文传递
导出
摘要 “民族”(nation)和“民族性”(nationhood)属于思想史领域最常研究的概念,而“民族”一词及其历史用法又十分模糊。文章旨在开发一种利用依存分析和神经词嵌入的数据驱动方法,以澄清这一概念的演变过程。为此提出以下两个步骤。首先,使用语言处理,创建一个与“民族”主题相关的大型单词集合。其次,训练历时词嵌入,并使用它们来量化这些词之间语义相似性的强度,从而创建有意义的聚类,然后将之历时排列。为了说明该方法在跨语言、多时间段及大型数据集研究上的稳健性,将其应用于荷兰语、瑞典语、芬兰语和英语共五份全套历史报纸档案集合。迄今为止,还没有如此大规模的比较研究——以数据驱动方法掌握多达四种不同语言的长期发展。文章所描述的方法还有一个特殊优势:通过设计,该方法可扩展应用至其他问题,而不仅限于对“民族性”的研究,并且可在不同语境中重复使用。 Nation and nationhood are among themost frequently studied concepts in the field ofintellectual history. At the same time,theword ‘nation’ and its historical usage are veryvague. The aim in this article was to develop a data-drivenmethod using dependencyparsing and neuralword embeddings to clarify some of the vagueness in the evolutionthis concept. To this end,we propose the following two-step method. First,usinglinguistic processing,we create a large set of words pertaining to the topic of nation. Second,we traindiachronicwordembeddings anduse themto quantify the strength ofthe semantic similarity between these words and thereby create meaningful clusters,which are then aligned diachronically. To illustrate the robustness of the study acrosslanguages,time spans,as well as large datasets,we apply it to the entirety of fivehistorical newspaper archives in Dutch,Swedish,Finnish,and English. To our knowledge,thus far there have been no large-scale comparative studies of this kind thatpurport to grasp long-term developments in as many as four different languages in adata-driven way. A particular strength of themethod we describe in this article is that,by design,it is not limited to the study of nationhood,but rather expands beyond it toother research questions and is reusable in different contexts.
作者 西蒙·恒晨 鲁本·罗斯 亚尼·马尔亚宁 米科·托洛宁 方华康(译) Simon Hengchen;Ruben Ros;Jani Marjanen;Mikko Tolonen;Fang Huakang
出处 《数字人文研究》 2022年第4期74-92,共19页 Digital Humanities Research
基金 欧盟“地平线2020”研究与创新项目770299(NewsEye)的支持,计算资源由CSC-IT科学中心有限公司提供 瑞典研究委员会支持的计算词汇语义变化检测项目(2019-2022,dnr2018-01184)资助
关键词 数字人文 数据驱动 历史报纸 词汇变化 digital humanities data-driven historical newspapers vocabulary change
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部