期刊文献+

基于Nutch的Web网站定向采集系统 被引量:10

Targeted Websites Harvest System Based on Nutch
下载PDF
导出
摘要 在对目前具有代表性的开源网络抓取软件Nutch、Heritrix、WCT、Web-Harvest进行比较分析的基础上,提出基于Nutch的Web网站定向采集系统,并对种子站点的选取、抓取过程管理、网页去噪、新种子站点的发现等关键问题进行重点探讨。 The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web - Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatieally, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.
作者 徐健 张智雄
出处 《现代图书情报技术》 CSSCI 北大核心 2009年第4期1-6,共6页 New Technology of Library and Information Service
基金 国家"十一五"科技支撑计划子课题"网络科技信息监测与评价"(项目编号:2006BAH03B05)的研究成果之一
关键词 网站定向采集系统 NUTCH 网站抓取 网页去噪 Targeted websites harvest system Nutch Website crawl Web page denoising
  • 相关文献

参考文献10

  • 1Nutch [ EB/OL ]. [ 2009 - 01 - 29 ]. http://wiki. apache.org/nutch/.
  • 2Doug Cutting. Nutch, Open - Source Web Search[ EB/OL]. [2009 - 01 - 29 ]. http://wiki. apache. org/nutch - data/attachments/ Presentations/attachments/www2004. pdf.
  • 3Heritrix Introduction[EB/OL]. [2009 -01 -291. http://crawler. archive. org/.
  • 4The Web Curator Tool Project [ EB/OL]. [ 2009 - 01 - 29 ]. http ://webcurator. sourceforge. net/.
  • 5Web - Harvest [ EB/OL ]. [ 2009 - 01 - 29]. http://web - harvest. sourceforge. net/.
  • 6Html Parser [ EB/OL]. [ 2009 - 01 - 29 ]. http://htmlparser. sourceforge. net/.
  • 7Intute, Best of the Web [ EB/OL]. [ 2009 - 01 - 29 ]. http:// www. intute. ac. uk/.
  • 8Dmoz Open Directory Project[ EB/OL]. [ 2009 - 01 - 29 ]. http :// www. dmoz. org/.
  • 9Yahoo! Developer Network [ EB/OL]. [ 2009 - 01 - 29 ]. http :// developer. yahoo. com/search/.
  • 10Nutch Version 0. 8. x Tutorial[EB/OL]. [2009 -01 -29]. http ://lucene. apache. org/nutch/tutorial8. html.

同被引文献78

引证文献10

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部