期刊文献+

基于多粒度语义融合的信息检索方法

Information retrieval method based on multi-granularity semantic fusion
下载PDF
导出
摘要 信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间的语义信息。针对该问题,提出一种名为MSIR(Multi-Scale IR)的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先,构建查询和文档中词语、短语和文本这3个粒度的语义单元;其次,利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征;最后,利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers))相比,MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升,同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明,MSIR能够成功融合多种语义粒度,提升检索精度。 Information Retrieval(IR)is a process that organizes and processes information using specific techniques and methods to meet users’information needs.In recent years,dense retrieval methods based on pre-trained models have achieved significant success.However,these methods only utilize vector representations of text and words to calculate the relevance between query and document,ignoring the semantic information at the phrase level.To address this issue,an IR method called MSIR(Multi-Scale Information Retrieval)was proposed.IR performance was enhanced by integrating semantic information of different granularities from the query and the document.First,semantic units of three different granularities—word,phrase,and text—were constructed in the query and the document.Then,the pre-trained model was used to encode these three semantic units separately to obtain their semantic representations.Finally,these semantic representations were used to calculate the relevance between the query and the document.Comparison experiments were conducted on three classic datasets of different sizes,including Corvid-19,TREC2019 and Robust04.Compared with ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers)),MSIR shows an approximately 8%improvement in the P@10,P@20,NDCG@10 and NDCG@20 indicators on Robust04 dataset,as well as some improvements on Corvid-19 and TREC2019 datasets.Experimental results demonstrate that MSIR can effectively integrate multi-granularity semantic information,thereby enhancing retrieval accuracy.
作者 赵征宇 罗景 涂新辉 ZHAO Zhengyu;LUO Jing;TU Xinhui(School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan Hubei 430065,China;School of Computer Science,Central China Normal University,Wuhan Hubei 430079,China)
出处 《计算机应用》 CSCD 北大核心 2024年第6期1775-1780,共6页 journal of Computer Applications
基金 国家语委重点项目(ZDI145-22) 湖北省教育厅人文社会科学研究项目(18Q028)。
关键词 语义融合 信息检索 稠密检索 预训练模型 文本检索 semantic fusion Information Retrieval(IR) dense retrieval pre-trained model text retrieval
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部