摘要
为了高效地获取与主题相关的资源,就垂直搜索引擎展开了研究。首先,在现有的PageRank算法基础上,提出一种改进的PageRank算法来测量网页的链接相似度;其次,从单个网页考虑,利用每个网页的url、title和正文,给出基于内容的相似度的计算方法;最后结合内容相似度和链接相似度,提出了一种基于链接和内容的BLCT主题爬行算法。实验结果表明,该算法在平均收获率和目标召回率上有显著提高,爬行的网页主题相关性也提高了。
This paper studied the method of vertical search engine to obtain the resources related with the tile effectively.Firstly,proposed the improved PageRank algorithm to measure the link similarity of the page.Secondly,put forward the similarity based on the content by using the url,title and text of each page.Finally,proposed BLCT topic crawling algorithm based on link and content by combining content similarity with link similarity.The experimental results show that proposed algorithm performs better in the average harvest rate and target recall rate,and the crawled pages relevant to the topic is more than the previous algorithm.
出处
《计算机应用研究》
CSCD
北大核心
2011年第2期495-497,528,共4页
Application Research of Computers