摘要
针对伪相关反馈模型反馈文档信息质量差和扩展词选择不适产生的漂移现象等问题,提出了一种基于约束的半监督聚类查询扩展方法。该方法对初检结果的前k个文档进行人工标注,分成相关文档与不相关文档两类;并利用一种半监督聚类算法对初检结果的前n个文档进行分析,提取出与查询相关的文档作为反馈文档。该方法通过对少量标注文档与查询相关性的学习,能够较准确地估计出大量未知文档与查询的相关性,提高反馈文档的质量,从而有效提高检索的查全率和查准率。实验结果表明,该方法比传统的伪相关反馈和基于无监督聚类的伪相关反馈有更优的检索性能。
Given that the quality of feedback documents of pseudo relevance feedback model is poor and expansion terms are select- ed inappropriatdy, the new query often drifts from the original query. We propose a query expansion method based on constrain- ed semi-supervised clustering. It marks the top k documents of the initial retrieval set in advance and divides them into relevant documents and irrelevant documents; it analyzes the top n documents using a semi-supervised clustering algorithm to find relevant documents used as feedback documents. The algorithm could more accurately estimate the correlation between a large number of unknown documents and query by learning from a small amount of documents that are known to us, thus improving the quality of the feedback information. The experimental results show that the proposed method outperforms both pseudo-relevance feedback and query-likelihood language model.
出处
《中国科技论文》
CAS
北大核心
2013年第10期994-997,共4页
China Sciencepaper
基金
国家自然科学基金资助项目(61073041
61073043)
黑龙江省自然科学基金资助项目(F200901)
高等学校博士学科点专项科研基金资助项目(20112304110011
20122304110012)
关键词
信息检索
查询扩展
约束聚类
半监督聚类
伪相关反馈
information retrieval
query expansion
constrained clustering
semi-supervised clustering
pseudo-relevance feedback