摘要
目前互联网中的网页数量以相当惊人的速度在增长。面对如此多的网页,用户往往只需要特定网站的网页,或者说只需要某一地区的网页,那么通用爬虫就无能为力了。因此,根据通用爬虫存在的不足,阐述了限定爬虫的相关概念以及技术,并基于Heritrix框架实现了通过IP地址限制爬虫只抓取某一地区主机上的网页。最后通过相关实验表明限定爬虫的合理性和实用性。
The number of webpages in Internet is on the rise in quite an alarming rate.Facing so many webpages,users often only need the webpages of a particular website,or of a certain region,so the common spider can be of no help.According to the shortcoming of common spider,in this paper we elaborate the related concept and the technologies of the qualified spider,and implement based on Heritrix framework and through IP address the qualified spider crawling webpages of the host of a certain area only.In end of the paper,relevant experiment shows that the qualified spider is reasonable and practical.
出处
《计算机应用与软件》
CSCD
北大核心
2013年第4期33-35,80,共4页
Computer Applications and Software
基金
国家自然科学基金项目(61170255)