期刊文献+

基于主题相关度的地理信息Web服务爬虫研究 被引量:12

Topic-Relevance Based Crawler for Geographic Information Web Services
下载PDF
导出
摘要 针对通用搜索引擎对于地理信息Web服务检索存在的不足,提出了一种基于主题相关度的服务爬虫方法,利用向量空间模型表示主题特征,通过引入特征值权重的计算方法分析页面内容与主题的相关度,过滤与主题无关的页面;并利用改进的PageRank算法从URL和锚文本两方面分析链接的重要性,优化爬取队列。实验表明,该方法在服务检索效率和抓取能力上都取得了良好的效果。 According to the defects of Common Search Engine on retrieving Geographic Information Web Services(GIServices),a web service crawler based on topic-relevance was designed and proposed in this paper.Firstly,this paper analyzed and defined the topic features of GIServices by utilizing Vector Space Model(VSM),which could facilitate the representation and calculation of topic features.Secondly,based on the introduction of calculation of topic weight,the paper presented an algorithm to analyze the similarity of web pages and eigenvector,which could be used to filter the web pages that were unrelated to the topic.Afterwards,an improved PageRank algorithm was reviewed based on analyzing the significance of hyperlink,which included the URL and anchor text,in order to optimize the crawling stack.The experimental results and analysis has proved that this method has distinct advantages on the searching efficiency and capturing ability compared to Common Search Engine.
出处 《地理与地理信息科学》 CSCD 北大核心 2012年第2期27-30,共4页 Geography and Geo-Information Science
基金 国家自然科学基金项目(41001216)
关键词 地理信息Web服务 服务检索 爬虫 主题相关度 geographic information Web services service retrieval crawler topic-relevance
  • 相关文献

参考文献15

二级参考文献110

共引文献234

同被引文献122

引证文献12

二级引证文献49

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部