摘要
传统的伪相关反馈(pseudo relevance feedback,PRF)方法,将文档作为基本抽取单元进行查询扩展,抽取粒度过大造成扩展源中噪音量的增加。研究利用主题分析技术来减轻扩展源的低质量现象。通过获取隐藏在伪相关文档集(pseudo-relevant set)各文档内容中的语义信息,并从中提取与用户查询相关的抽象主题内容作为基本抽取单元用于查询扩展。在NTCIR 8中文语料上,与传统PRF方法和基于主题模型的PRF方法相比较,实验结果表明该方法可以抽取出更符合用户查询的扩展词。此外,结果显示从更小的主题内容粒度出发进行查询扩展,可以有效提升检索性能。
Traditional pseudo relevance feedback(PRF)algorithms use the document as a unit to extract words for query expansion,which will increase the noise of expansion source due to the larger extraction unit.This paper exploits the topic analysis techniques so as to alleviate the low quality of expansion source condition.Obtain semantic information hidden in the content of each document of pseudo-relevant set,and extract the abstract topic content information according to the relevance of the user query,which is described as a basic extraction unit to be used for query expansion.Compared with the traditional PRF algorithms and the PRF based on topic model algorithm,the experimental results on NTCIR8dataset show that the scheme in this paper can effectively extract more appropriate expansion terms.In addition,the results also show that the scheme in this paper has a positive impact to improve the retrieval performance on a smaller topic content granularity level.
作者
闫蓉
高光来
YAN Rong;GAO Guanglai(College of Computer Science, Inner Mongolia University, Hohhot 010021, China)
出处
《计算机科学与探索》
CSCD
北大核心
2017年第5期814-821,共8页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金No.61263037
内蒙古自然科学基金Nos.2014BS0604
2014MS0603~~
关键词
主题模型
主题内容
伪相关反馈
topic model
topic content
pseudo relevance feedback (PRF)