摘要
信息检索是从文档集合或互联网中找出用户所需信息的过程,细化为召回和排序两个阶段。针对排序阶段中相关文档的重排序,提出一种称为融合排序学习与预训练模型的检索排序方法(Pair-Wise FineTuned Bidirectional Encoder Representation from Transformers,PWFTBERT)。通过对候选论文数据集使用BM25等算法召回出与查询相关的小范围文档后,可应用PWFT-BERT对召回得到的文档集合进行排序。为构造pair-wise形式的训练数据,提出一种伪负例生成算法生成训练数据,并使用排序学习方法微调预训练模型使其适配排序任务。对比IT-IDF和BM25基线方法,PWFT-BERT在WSDM-DiggSci 2020数据集上的检索结果提升了240%和74%,证明了所提方法的有效性。
Information Retrieval is the process of finding relevant information needed by users from Internet or large document collections,which includes two stages:recall and ranking.To address the re-ranking of related documents in the ranking stage,a retrieval ranking method called PWFT-BERT is proposed,which integrates Learning to Rank and pre-training models.First,by using recall algorithms such as BM25,the candidate paper dataset is recalled to a small range of documents related to query,and then PWFT-BERT is used to rank the recalled documents.To train PWFT-BERT,we construct pair-wise form training data by using a pseudo-negative example generation algorithm,and use Learning to Rank method to fine-tune the pre-trained model to fit the ranking task.Compared with the IT-IDF and BM25 baseline methods,the retrieval results of PWFT-BERT on the WSDMDiggSci 2020 dataset are improved by 240% and 74%,respectively,proving the effectiveness of the proposed method.
作者
苏珂
黄瑞阳
张建朋
胡楠
余诗媛
SU Ke;HUANG Ruiyang;ZHANG Jianpeng;HU Nan;YU Shiyuan(Zhengzhou University,Zhengzhou 450001,China;Information Engineering University,Zhengzhou 450001,China)
出处
《信息工程大学学报》
2022年第4期460-466,共7页
Journal of Information Engineering University
基金
国家自然基金青年基金资助项目(62002384)
中国博士后科学基金面上项目(2020M683760)。
关键词
自然语言处理
信息检索
排序学习
预训练模型
检索排序
natural language processing
information retrieval
learning to rank
pre-trained models
retrieval ranking