期刊文献+

基于Heritrix限定爬虫的设计与实现 被引量:13

DESIGN AND IMPLEMENTATION OF QUALIFIED SPIDER BASED ON HERITRIX
下载PDF
导出
摘要 目前互联网中的网页数量以相当惊人的速度在增长。面对如此多的网页,用户往往只需要特定网站的网页,或者说只需要某一地区的网页,那么通用爬虫就无能为力了。因此,根据通用爬虫存在的不足,阐述了限定爬虫的相关概念以及技术,并基于Heritrix框架实现了通过IP地址限制爬虫只抓取某一地区主机上的网页。最后通过相关实验表明限定爬虫的合理性和实用性。 The number of webpages in Internet is on the rise in quite an alarming rate.Facing so many webpages,users often only need the webpages of a particular website,or of a certain region,so the common spider can be of no help.According to the shortcoming of common spider,in this paper we elaborate the related concept and the technologies of the qualified spider,and implement based on Heritrix framework and through IP address the qualified spider crawling webpages of the host of a certain area only.In end of the paper,relevant experiment shows that the qualified spider is reasonable and practical.
作者 张敏 孙敏
出处 《计算机应用与软件》 CSCD 北大核心 2013年第4期33-35,80,共4页 Computer Applications and Software
基金 国家自然科学基金项目(61170255)
关键词 限定爬虫 HERITRIX IP地址 合理性 实用性 Qualified spider Heritrix IP address Reasonability Practicality
  • 相关文献

参考文献3

二级参考文献37

  • 1祝宇,夏诏杰,聂峰光,郭力.支持向量机在化学主题爬虫中的应用[J].计算机与应用化学,2006,23(4):329-332. 被引量:8
  • 2Chakrabarti S,Dom B,Indyk P.Enhanced hypertext categorization using hyperlinks[C].New York:ACM,1998:3072318.
  • 3Johnson J,Tsioutsioul I I K,Giles C L.Evolving strategies for focused Web crawling[C].Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).Washington DC:[s.n.],2003.
  • 4Gautam Pant,Padmini Srinivasan.Learning to crawl:comparing classification schemes[J].ACM Transactions on Information Systems,2005,23:4302462.
  • 5Pant G,Tsioutsiouliklis K,Johnson J,et al.Panorama:Extending digital libraries with topical crawlers[C].New York:[s.n.],2004.
  • 6Diligenti M,Coetzee F,Lawrence S,et al.Focused crawling using context graphs[C].Egypt:Cairo,2000:527.
  • 7Johnson J,Tsioutsiouliklis K,Giles C L.Evolving strategies for focused web crawling[C].Washington DC:[s.n.],2003.
  • 8Chakrabarti S,Van Den Berg M,Dom B.Focused crawling:a new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31:1623.
  • 9EHRIG M, MAEDCHE A. Ontology-focused crawling of Web documents[A]. Proceedings of the 2003 ACM symposium on Applied computing[C], March 2003.
  • 10GUO Q, GUO H, ZHANG ZQ, et al. Schema Driven Topic Specific Web Crawling[A]. DASFAA[C], 2005.

共引文献195

同被引文献105

引证文献13

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部