基于多粒度语义融合的信息检索方法

Information retrieval method based on multi-granularity semantic fusion

下载PDF

导出

摘要信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间的语义信息。针对该问题,提出一种名为MSIR(Multi-Scale IR)的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先,构建查询和文档中词语、短语和文本这3个粒度的语义单元;其次,利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征;最后,利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers))相比,MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升,同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明,MSIR能够成功融合多种语义粒度,提升检索精度。 Information Retrieval(IR)is a process that organizes and processes information using specific techniques and methods to meet users’information needs.In recent years,dense retrieval methods based on pre-trained models have achieved significant success.However,these methods only utilize vector representations of text and words to calculate the relevance between query and document,ignoring the semantic information at the phrase level.To address this issue,an IR method called MSIR(Multi-Scale Information Retrieval)was proposed.IR performance was enhanced by integrating semantic information of different granularities from the query and the document.First,semantic units of three different granularities—word,phrase,and text—were constructed in the query and the document.Then,the pre-trained model was used to encode these three semantic units separately to obtain their semantic representations.Finally,these semantic representations were used to calculate the relevance between the query and the document.Comparison experiments were conducted on three classic datasets of different sizes,including Corvid-19,TREC2019 and Robust04.Compared with ColBERT(ranking model based on Contextualized late interaction over BERT(Bidirectional Encoder Representation from Transformers)),MSIR shows an approximately 8%improvement in the P@10,P@20,NDCG@10 and NDCG@20 indicators on Robust04 dataset,as well as some improvements on Corvid-19 and TREC2019 datasets.Experimental results demonstrate that MSIR can effectively integrate multi-granularity semantic information,thereby enhancing retrieval accuracy.

作者赵征宇罗景涂新辉 ZHAO Zhengyu;LUO Jing;TU Xinhui(School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan Hubei 430065,China;School of Computer Science,Central China Normal University,Wuhan Hubei 430079,China)

机构地区武汉科技大学计算机科学与技术学院华中师范大学计算机学院

出处《计算机应用》 CSCD 北大核心 2024年第6期1775-1780,共6页 journal of Computer Applications

基金国家语委重点项目(ZDI145-22) 湖北省教育厅人文社会科学研究项目(18Q028)。

关键词语义融合信息检索稠密检索预训练模型文本检索 semantic fusion Information Retrieval(IR) dense retrieval pre-trained model text retrieval

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1何晨.基于大数据技术的政务服务平台信息检索方法[J].无线互联科技,2024,21(8):60-62.
2Tiziano Londei.Birds as seed dispersers in deserts:suggestions from the ground-jays[J].Avian Research,2021,12(2):280-283.
3郜鹏,孟凤,赵海明,解旭东,薛蕊.基于DPSIR的冬奥廊道生态安全格局构建[J].花卉,2023(12):112-114.
4陈冰婷,邹卫琴,蔡碧瑜,刘文杰.基于领域知识微调的缺陷报告严重性预测[J].计算机科学,2024,51(S01):835-841.
5欧阳晋平.基于无损检测法的混凝土面板堆石坝健康检测[J].水利科技与经济,2023,29(11):141-145.
6Adrian Orihuela-Torres,Juan MPérez-García,Zebensui Morales-Reyes,Lara Naves-Alegre,JoséASánchez-Zapata,Esther Sebastián-González.Avian-power line interactions in the Gobi Desert of Mongolia:are mitigation actions effective?[J].Avian Research,2021,12(3):382-390.
7蓝晓东,赵敏彤,黄欣,肖勇.基于H型指数的AI多维知识地图信息检索研究[J].自动化技术与应用,2024,43(6):112-115.
8李挺,金福生,李荣华,王国仁,段焕中,路彦雄.Light-HGNN:用于圈层内容推荐的轻量同质超图神经网络[J].计算机研究与发展,2024,61(4):877-888. 被引量：1
9杨正伟,张言,张赛赛,寇光杰,刘梦庆,蔡辉,谢星宇.复合材料壳体结构损伤演化多尺度表征与强度智能预测进展[J].郑州航空工业管理学院学报,2024,42(1):12-26.
10罗宏宇,刘伟.基于语义层级细粒度的海量文献标引研究[J].情报理论与实践,2024,47(5):194-203. 被引量：1

计算机应用

2024年第6期

浏览历史

内容加载中请稍等...

基于多粒度语义融合的信息检索方法

相关作者

相关机构

相关主题

浏览历史