期刊文献+

高性能并行爬行器 被引量:7

High performance parallel crawler
下载PDF
导出
摘要 爬行器是搜索引擎的重要组成部分,它在搜索引擎中负责网络信息采集。详细介绍了Chao,一个高性能并行爬行器的设计和实现,包括它的系统框架、主要模块、运行流程、调度算法和URL检索算法。Chao的调度算法采用两次散列计算,不仅实现了负载平衡,而且在一定程度上避免了冲突;URL检索融合了树查找算法,在实现了快速检索的同时减少了存储空间需求。 A web crawler is an important component of a search engine for information gathering, but its design is not well-documented in the literature. Chao is introduced in detail, which is a high performance parallel crawler, including its overall architecture, major components, working process and two core algorithms, scheduling and URL indexing. The scheduling algorithm using double hashing nOt only realizes load balance, but also avoids collision. The URL indexing algorithm based on tree searching, achieving both a fast searching speed and a significantly low storage requirement.
机构地区 北京工业大学
出处 《计算机工程与设计》 CSCD 北大核心 2006年第24期4762-4766,共5页 Computer Engineering and Design
关键词 搜索引擎 信息采集 爬行器 并行 检索 search engine information gathering crawler parallel retrieval
  • 相关文献

参考文献13

  • 1Junghoo Cho,Hector Garcia-Molina.Parallel crawlers[C].Honolulu:Proceedings of the 11 th International World Wide Web Conference,ACM Press,2002.124-135.
  • 2Sergey Brin,Lawrence Page.The anatomy of a large-scale hypertextual web search engine[J].Computer Networks and ISDN Systems,1998,30:107-117.
  • 3Allan Heydon,Marc Najork.Mercator:A scalable,extensible web crawler[J].World Wide Web,1999,(2):219-229.
  • 4Marc Najork,Janet L Wiener.Breadth-first search crawling yields high quality pages[C].Hong Kong:Proceedings of 10th International World Wide Web Conference,ACM Press,2001.114-118.
  • 5Paolo Boldi,Bruno Codenotti,Massimo Santini,et al.UbiCrawler:A scalable fully distributed web crawler[J].Software:Practice and Experience,2004,34(8):711-726.
  • 6George Samaras,Odysseas Papapetrou.Distributed location aware web crawling[C].New York,USA:Proceedings of the 13th international World Wide Web conference,ACM Press,2004.468-469.
  • 7Boon Thau Loo,Sailesh Krishnamurthy,Owen Cooper.Distributed web crawling over DHTs[R].UC Berkeley Technical Report UCB//CSD-4-1305,2004.
  • 8Marc Najork,Allan Heydon.High-performance web crawling[C].Handbook of Massive Data Sets,Kluwer Academic Publishers Inc,2001.25-45.
  • 9Kasom Koht-arsa,Surasak Sanguanpong.High performance large scale web spider architecture[EB/OL].http://anres.cpe.ku.ac.th/pub/thesis-spider.pdf.
  • 10Carlos Castillo,Mauricio Marin,Andrea Rodriguez,et al.Scheduling algorithms for web crawling[C].Brazil:WEBMEDIA and LA-WEB,IEEE Cs Press,2004.10-17.

二级参考文献9

  • 1Cormen TH,Leiserson CE.Introduction to Algorithms.2nd ed.,Cambridge:MIT Press,2001.221-252.
  • 2Knuth DE.Sorting and Searching,Volume 3 of the Art of Computer Programming.New York:Addison-Wesley,1973.506-549.
  • 3McKenzie BJ,Harries R,Bell T.Selecting a hashing algorithm.Software Practice and Experience,1990,20(2):208-210.
  • 4Tong MCF.General hashing [Ph.D.Thesis].Computer Science Department,University of Auckland,1996.
  • 5Peter K.Pearson,fast hashing of variable length text strings.Communications of the ACM,1990,33(6):676-678.
  • 6Berners-Lee T.Universal resource locator.2003.http://www.w3.org/Addressing/URL/Overview.html
  • 7Yan HF,Wang JY,Li XM,Guo L.Architectural design and evaluation of an efficient Web-crawling system.Journal of System and Software,2002,60(3):185-193.
  • 8Shaffer CA.Zhang M,Liu XD,Trans.Data Structure and Algorithm Analysis.Beijing:Publishing House of Electronics Industry,1998.211-213(in Chinese).
  • 9ShafferCA 著 张铭 刘晓丹 译.数据结构与算法分析[M].北京:电子工业出版社,1998.211-213.

共引文献44

同被引文献78

引证文献7

二级引证文献103

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部