摘要
[目的/意义]为了支撑数据驱动研究范式,促进科学数据的共享与利用,提高数据仓储和发现平台中数据集检索功能的效果。[方法/过程]将科学数据集检索过程分为候选数据集检索和候选数据集重排序两个主要阶段:在第一阶段,将BM25模型与基于SimCSE的稠密检索模型结合,获取潜在的相关数据集;在第二阶段,基于BERT排序模型对候选数据集的相关性进行评分,据此优化检索结果排序。[结果/结论]从国内外13个人文社科相关的科学数据仓储平台采集约10万数据集的元数据进行检索实验和效果评价,结果表明:提出的检索模型效果最优,其NDCG@10的得分分别高于基准的向量空间和BM25模型23.6和11.7个百分点;对各模型检索结果分析发现,该模型相比基准模型具有更强的语义检索能力;此外,还对模型权重设置进行分析,可为实践应用中的参数设置提供参考。[局限]仅在英文人文社科数据集上进行模型效果验证。
[Purpose/significance]In order to support the data-driven research paradigm,promote the sharing and reusing of scientific data,and improve the effectiveness of the dataset retrieval function in the data repository and discovery platform.[Method/process]This paper divides the scientific dataset retrieval process into two main stages:candidate dataset retrieval and candidate dataset reranking.In the first stage,the BM25 model is combined with SimCSE-based dense retrieval model to obtain potentially relevant datasets.In the second stage,the ranking of the search results is optimized based on the candidate dataset’s relevance which is scored based on the BERT ranking model.[Result/conclusion]The metadata of about 100,000 datasets are collected from 13 Humanities and Social Sciences related scientific data repositories at home and abroad for retrieval experiments and model effect evaluation.The results show that:our model has the best effect,and its NDCG@10 score is 23.6 and 11.7 percentage points higher than the benchmark Vector Space Model and BM25 Model respectively;analysis of the retrieval results of each model shows that our model has stronger semantic retrieval capabilities than the benchmark model;in addition,we also analyze the model parameter settings,which can provide help in practical applications.[Limitations]Only perform model effect verification on English Humanities and Social Science related scientific datasets.
出处
《情报理论与实践》
CSSCI
北大核心
2022年第7期49-56,共8页
Information Studies:Theory & Application
基金
国家社会科学基金重点项目“开放科学数据集统一发现的关键问题与平台构建研究”的成果,项目编号:20ATQ007。