摘要
在具有多个真实搜索引擎的联邦检索环境下,基于小文档的资源选择算法由于难以估计每个搜索引擎的真实网页数量,因此准确率较低.针对这个问题,文中提出了基于主题模型的资源库描述方法,利用LDA主体模型获取每个资源库的描述词;在此基础上提出新的资源选择算法,结合垂直领域权重和词向量计算资源库和查询请求之间的相关度,并根据相关度大小获取最终资源选择结果.实验结果表明,基于主题模型的资源选择算法能很好地提高资源选择效果,可有效应用于分布式搜索引擎的联邦检索环境.
In the federated search environment with multiple real search engines, the small- document approach, which is inefficient in estimating the accurate number of indexed files in the process of resource description, may result in poor performance of resource selection methods. In order to solve this problem, a resource library descrip-tion method on the basis of topic model is proposed, which adopts LDA topic model to obtain the description word of each resource library. Then, a new resource selection algorithm is proposed, which combines with both vertical weight and word vector to calculate the correlation between resource library and query request, and to obtain the fi-nal resource selection results according to the correlation. Experimental results show that the proposed resource se-lection algorithm on the basis of topic model improves the performance of resource selection and can be effectively applied in the federated search environment of distributed search engines.
作者
董守斌
谢一帆
袁华
陈建豪
DONG Shou-bin XIE Yi-fan YUAN Hua CHEN Jian- hao(School of Computer Science and Engineering//Computation & Computer Network Laboratory of Guangdong Province, South China University of Technology, Guangzhou 510006, Guangdong, China)
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2017年第3期48-53,共6页
Journal of South China University of Technology(Natural Science Edition)
基金
广东省自然科学基金重大基础研究培育项目(2015A030308017)
教育部中国移动科研基金资助项目(MCM20150512)~~
关键词
分布式检索
资源选择
主题模型
垂直领域
词向量
distributed search
resource selection
topic model
vertical domain
word vector