期刊文献+

分布式网络爬虫URL去重策略的改进 被引量:3

Improvement on Unrepeated Tactics of URL of Distributed Spider
下载PDF
导出
摘要 分布式网络爬虫作为一门新兴技术,已经应用在一些大型商业的搜索引擎系统当中.重点放在分布式技术在网络爬虫领域中,URL去重这一分布式网络爬虫的核心问题上,以基于内存的去重方式为基础,扩展改进传统的广义表数据结构,提出了一种新的基于内存改进广义表的URL去重算法.这种算法与传统的去重算法相比较,在空间效率可行范围之内,有效地缩短了单次去重的时间,使总控服务器上的去重不再成为整个系统的瓶颈. As a new technology, distributed web spider has been widely applied to some great commercial search engine systems. The stress is laid on the core problem of distributed web spider - Unrepeated URL. Based on the memory mode Unrepeated URL, the traditional generalized lists data framework is expanded and improved. A new Unrepeated URL algorithm based on the memory improved generalized list is put forth. Compare with the traditional Unrepeated algorithm, this algorithm can effectively improve the time of single detecting nearduplicate under the approval range of space efficiency, which makes the Unrepeated URL in the general control server impossible to become the bottle - neck of the whole system.
作者 吴小惠
出处 《平顶山学院学报》 2009年第5期116-119,共4页 Journal of Pingdingshan University
关键词 网络爬虫 分布式 URL去重 广义表 wob spider distributed unrepeated URL generalized lists
  • 相关文献

参考文献3

二级参考文献51

  • 1[1]B Bloom.Space/time tradeoffs in hash coding with allowable errors[J].Communications of the ACM,1970,13(7):422-426.
  • 2[2]M Mitzenmacher.Compressed bloom filters[A].In Proceedings of the 20th ACM Symposium on Principles of Distributed Computing (PODC2001)[C].Newport,Rhode,Island,2001.
  • 3[3]Li Fan,P Cao,J Almeida,A Broder.Summary cache:A scalable wide-area web cache sharing protocol[J].IEEE/ACM transactions on networking,2000,8(3).
  • 4[4]J Kubiatowicz,D Bindel,Y Chen,S Czerwinski,P Eaton,D Geels,R Gummadi,S Rhea,H Weatherspoon,W Weimer,Cwells,B Zhao.OceanStore:An architecture for globe-scale persistent storage[A].In proceedings of the 9th international conference on architectural support for programming languages and operating systems (ASPLOS 2000)[C].Cambridge,MA,2000.
  • 5[5]M V Ramakrishna.Practical performance of bloom filters and parallel free-text searching[J].Communications of the ACM,1989,32(10):1237-1239.
  • 6[6]J K Mulllin.A second look at bloom filters[J].Communiations of the ACM,1983,26(8):570-571.
  • 7[7]I H Witten,A Moffat,T Bell.Managing Gigabytes (2nd Edition)[M].Morgan Kaufmann,San Francisco:Morgan Kaufmaan,1999.
  • 8[8]George Coulouris,Jean Dollimore,et al.Distributed Systems Concepts and Design (3rd Edition)[M].Reading,Mass:Addison Wesley,2001.
  • 9[9]C Stanfill,B Kahle.Parallel free-text search on the connection machine system[J].Communication of the ACM,1986,29(12).
  • 10[10]Wing Ho A Yuen,et al.A hybrid bloom filter location update algorithm for wireless cellular systems[A].IEEE International Conference on Communications[C].Montreal,ICC(3),1997.1281-1286.

共引文献181

同被引文献15

引证文献3

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部