摘要
针对通用搜索引擎对于地理信息Web服务检索存在的不足,提出了一种基于主题相关度的服务爬虫方法,利用向量空间模型表示主题特征,通过引入特征值权重的计算方法分析页面内容与主题的相关度,过滤与主题无关的页面;并利用改进的PageRank算法从URL和锚文本两方面分析链接的重要性,优化爬取队列。实验表明,该方法在服务检索效率和抓取能力上都取得了良好的效果。
According to the defects of Common Search Engine on retrieving Geographic Information Web Services(GIServices),a web service crawler based on topic-relevance was designed and proposed in this paper.Firstly,this paper analyzed and defined the topic features of GIServices by utilizing Vector Space Model(VSM),which could facilitate the representation and calculation of topic features.Secondly,based on the introduction of calculation of topic weight,the paper presented an algorithm to analyze the similarity of web pages and eigenvector,which could be used to filter the web pages that were unrelated to the topic.Afterwards,an improved PageRank algorithm was reviewed based on analyzing the significance of hyperlink,which included the URL and anchor text,in order to optimize the crawling stack.The experimental results and analysis has proved that this method has distinct advantages on the searching efficiency and capturing ability compared to Common Search Engine.
出处
《地理与地理信息科学》
CSCD
北大核心
2012年第2期27-30,共4页
Geography and Geo-Information Science
基金
国家自然科学基金项目(41001216)
关键词
地理信息Web服务
服务检索
爬虫
主题相关度
geographic information Web services
service retrieval
crawler
topic-relevance