期刊文献+

基于深度学习的科学数据集检索方法研究 被引量:9

Research on Deep Learning Based Scientific Dataset Retrieval Method
原文传递
导出
摘要 [目的/意义]为了支撑数据驱动研究范式,促进科学数据的共享与利用,提高数据仓储和发现平台中数据集检索功能的效果。[方法/过程]将科学数据集检索过程分为候选数据集检索和候选数据集重排序两个主要阶段:在第一阶段,将BM25模型与基于SimCSE的稠密检索模型结合,获取潜在的相关数据集;在第二阶段,基于BERT排序模型对候选数据集的相关性进行评分,据此优化检索结果排序。[结果/结论]从国内外13个人文社科相关的科学数据仓储平台采集约10万数据集的元数据进行检索实验和效果评价,结果表明:提出的检索模型效果最优,其NDCG@10的得分分别高于基准的向量空间和BM25模型23.6和11.7个百分点;对各模型检索结果分析发现,该模型相比基准模型具有更强的语义检索能力;此外,还对模型权重设置进行分析,可为实践应用中的参数设置提供参考。[局限]仅在英文人文社科数据集上进行模型效果验证。 [Purpose/significance]In order to support the data-driven research paradigm,promote the sharing and reusing of scientific data,and improve the effectiveness of the dataset retrieval function in the data repository and discovery platform.[Method/process]This paper divides the scientific dataset retrieval process into two main stages:candidate dataset retrieval and candidate dataset reranking.In the first stage,the BM25 model is combined with SimCSE-based dense retrieval model to obtain potentially relevant datasets.In the second stage,the ranking of the search results is optimized based on the candidate dataset’s relevance which is scored based on the BERT ranking model.[Result/conclusion]The metadata of about 100,000 datasets are collected from 13 Humanities and Social Sciences related scientific data repositories at home and abroad for retrieval experiments and model effect evaluation.The results show that:our model has the best effect,and its NDCG@10 score is 23.6 and 11.7 percentage points higher than the benchmark Vector Space Model and BM25 Model respectively;analysis of the retrieval results of each model shows that our model has stronger semantic retrieval capabilities than the benchmark model;in addition,we also analyze the model parameter settings,which can provide help in practical applications.[Limitations]Only perform model effect verification on English Humanities and Social Science related scientific datasets.
出处 《情报理论与实践》 CSSCI 北大核心 2022年第7期49-56,共8页 Information Studies:Theory & Application
基金 国家社会科学基金重点项目“开放科学数据集统一发现的关键问题与平台构建研究”的成果,项目编号:20ATQ007。
关键词 信息检索 数据集搜索 科学数据 神经网络 学习排序 BERT SimCSE information retrieval dataset search scientific data neural network learning to rank BERT SimCSE
  • 相关文献

参考文献1

二级参考文献18

  • 1王阿川,李丹.基于ontology的地理信息系统构建中的信息共享[J].东北林业大学学报,2006,34(6):107-109. 被引量:2
  • 2肖琨焘,李德顺.本体论[M]//中国大百科全书·哲学卷Ⅰ.北京:中国大百科全书出版社,1987:3.
  • 3NECHES R, FIKES R E, GRUBER T R, et al. Enabling technology for knowledge sharing [J]. AI Magazine, 1991, 12 (3) : 36-56.
  • 4GRUBER T R. A translation approach to portable ontologies [J]. Knowledge Acquisition, 1993, 5 (2): 199-220.
  • 5BORST W N. Construction of engineering ontologies for knowledge sharing and reuse [D]. Enschede: University of Twente, 1997 : 56-71.
  • 6STUDER R, BENJARNINS V R, FENSEL D. Knowledge engineering, principles and methods [ J ]. Data and Knowledge Engineering, 1998, 25 (122) : 161-197.
  • 7BERNERS-LEE T, HENDLER J, LASSILA O. The semantic Web [J]. Scientific American, 2001, 284 (5) : 34.
  • 8GUARINO N. Formal ontology and information systems [C] // Proceedings of the 1 International Conference on Formal Ontology in Information System, Trento, Italy: IOS Press, 1998.
  • 9HEFLIN J, HENDLER J. Searching the Web with SHOE [ C ] //Artificial Intelligence for Web Search, Menlo Park, CA: AAAI, 2001: 35-40.
  • 10ROBERT E K. Conceptual knowledge markup language: the central core [C] //The 12th Workshop on Knowledge Acquisition, Modeling and Management (KAW99), Banff, Canada, 1999.

共引文献9

同被引文献147

引证文献9

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部