期刊文献+

基于窗口比较的网站信息增量爬取方法 被引量:1

Window Comparison Based Incremental Crawling Approach for Websites
下载PDF
导出
摘要 目前在网站信息增量爬取中,采用布隆过滤器去重是比较有效的方法,但随着存入的元素数量增加,误算率随之增加。为此本文设计并实现了一种基于窗口比较的网站信息增量爬取方法,按照网站数据呈现顺序一次性爬取有限长度的数据,并按照网站数据的呈现顺序放入数据队列,在数据队列末端设定比较窗口,通过检查比较窗口内的数据与已爬取数据的重复度决定是否停止数据爬取。实验表明,针对增量爬取未严格按照时间排序网站信息时,本方法降低了爬取损耗。 Nowadays, Bloom filters are useful methods for the incremental crawling of websites. With the increasing of the stored items, the error rate is also enlarged. To solve this problem, we proposed a window comparison based incremental crawling approach, by which the information on the websites can be crawled within the limited length, and it will be stored in the data queue based on the dis- play location in the website. A window is set at the end of the queue, which is used to check how much data is crawled by several times and whether the crawling process should be terminated. The simulation shows that, this approach can reduce the cost of the crawling for the website in which the incremented data is not displayed based on the updated time.
出处 《网络新媒体技术》 2017年第4期24-27,共4页 Network New Media Technology
基金 中国科学院战略性先导科技专项(编号:XDA06040602)
关键词 增量爬取 爬取效率 HASH 布隆过滤器 incremental crawling, crawling efficiency, Hash, Bloom filter
  • 相关文献

参考文献2

二级参考文献26

  • 1李魁,程学旗,郭岩,张凯.WWW论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82. 被引量:11
  • 2Cai Rui,Yang Jiangming,Lai Wei,et al.iRobot:An Intelligent Crawler for Web Forums[C]//Proc.of the 17th International World Wide Web Conference.Beijing,China:[s.n.],2008.
  • 3Cho J,Garcia M H.The Evolution of the Web and Implications for an Incremental Crawler[C]//Proc.of the 26th Int'l Conf.on Very Large Data Bases.Cairo,Egypt:[s.n.],2000.
  • 4Cho J,Garcia M H.Estimating Frequency of Change[J].ACM Trans.on Internet Technology,2003,3(3):256-290.
  • 5Brewington B,Cybenko G.Keeping up with the Changing Web[J].IEEE Computer,2000,33(5):52-58.
  • 6Zheng Shuyi.Joint Optimization of Wrapper Generation and Template Detection[C]//Proc.of the 13th ACM Int'l Conf.on Knowledge Discovery and Data Mining.San Jose,CA,USA:[s.n.],2007.
  • 7Cho J,Garcia M H.Synchronizing a Database to Improve Freshness[C]//Proc.of 2000 ACM SIGMOD International Conference on Management of Data.Dallas,Texas,USA:[s.n.],2000.
  • 8Broder A Z, et al. Efficient URL caching for world wide web crawling[ A ]. Proc of WWW 2003 [ C ]. Budapest, Hungary: ACM, 2003.679 - 689.
  • 9Fan L, et al. Summary cache: a scalable wide-area web cachesharing protocol[ J ]. IEEE/ACM Trans on Networking, 2000,8 (3) :281 - 293.
  • 10Huang N F,et al.A fast URL lookup engine for content-aware multi-gigabit switches [ A ]. Proc of AINA 2005 [ C ]. Taipei, Taiwan: IEEE Computer Society, 2005.641 - 646.

共引文献9

同被引文献14

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部