摘要
在对目前具有代表性的开源网络抓取软件Nutch、Heritrix、WCT、Web-Harvest进行比较分析的基础上,提出基于Nutch的Web网站定向采集系统,并对种子站点的选取、抓取过程管理、网页去噪、新种子站点的发现等关键问题进行重点探讨。
The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web - Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatieally, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.
出处
《现代图书情报技术》
CSSCI
北大核心
2009年第4期1-6,共6页
New Technology of Library and Information Service
基金
国家"十一五"科技支撑计划子课题"网络科技信息监测与评价"(项目编号:2006BAH03B05)的研究成果之一