期刊文献+

增量更新并行W eb爬虫系统 被引量:8

Parallel Web crawler system with increment update
下载PDF
导出
摘要 介绍了并行Web爬虫系统的总体结构,引入了增量更新爬虫策略,在提高Web海量数据更新效率的同时,考虑到机群中各个爬虫的能力不一,为了使机群中爬虫的能力得到充分应用,又提出了向量度量技术,解决了抓取任务和爬虫能力匹配的问题。对抓取任务向量、爬虫向量进行了定义,并在此基础上给出了相关的并行算法。实践表明,系统具有良好的分配适应性,并可以在此基础上渐增式地提高网页库新鲜度。 This paper discussed the architecture of parallel Web crawler system. Incremental crawling method was used to the system to improve the efficiency of massive information updating. Meanwhile, considering the difference of crawler in the system and with the aim of fully usage of crawler in cluster system, Cosine vector parallel crawling model was introduced to solve this problem. After giving the definitions of crawling task vector and crawler vector, relevant parallel crawling algorithms were designed. The results confirm that the system is effective in distribution adaptability and runs well in maintaining the "freshness" of the Web repository.
出处 《计算机应用》 CSCD 北大核心 2009年第4期1117-1119,1127,共4页 journal of Computer Applications
基金 国家自然科学基金资助项目(60573108) 上海教委发展基金资助项目(06QZ00207ZZ92) 上海教委科研创新重点项目(08ZZ76) 上海市重点学科建设项目(s30501)
关键词 Web数据抓取 并行爬虫 增量更新策略 余弦向量法 Web data crawling parallel crawler increment update strategy cosine vector
  • 相关文献

参考文献9

  • 1KIM S J, LEE S H. An empirical study on the change of Web pages [ C]// Proceedings of the 7th Asia-Pacific Web Conference on Web Technologies Research and Development: APWeb 2005, LNCS 3399. Heidelberg: Springer-Verlag, 2005:632-642.
  • 2北大网络实验室.Web InfoMall[EB/OL].[2008-08-11].http://www.infomall.cn/.
  • 3CHO J, GARCIA-MOLINA H. Parallel crawlers[ C]// Proceedings of the 11th International Conference on World Wide Web: WWW 2002. New York: ACM Press, 2002: 124- 135.
  • 4孟涛,王继民,闫宏飞.网页变化与增量搜集技术[J].软件学报,2006,17(5):1051-1067. 被引量:22
  • 5沈文勤,李庆超,邵志清.搜索引擎的渐增式爬行和备份式更新模式[J].华东理工大学学报(自然科学版),2004,30(3):284-287. 被引量:6
  • 6程菲,汪建海,罗键.增量更新Crawler进行Web收集方法研究[J].计算机工程与科学,2006,28(12):28-30. 被引量:2
  • 7CHO J, GARCIA-MOLINA H. The evolution of the Web and implications for an incremental crawler[ C]//Proceedings of the 26th International Conference on Very Large Databases. San Francisco: Morgan Kaufmann Publishers, 2000:200 - 209.
  • 8FETTERLY D, MANASSE M, NAJORK M, et al. A large-scale study of the evolution of Web pages[ C]// Proceedings of the 12th International Conference on World Wide Web. New York: ACM Press, 2003:669-678.
  • 9SALTON G, BUCKLEY C. Term-weighting approaches in automatic retrieval[ J]. Information Processing and Management, 1998, 24 (5): 513-523.

二级参考文献14

  • 1孟涛,闫宏飞,王继民.Web网页信息变化的时间局部性规律及其验证[J].情报学报,2005,24(4):398-406. 被引量:8
  • 2[1]Cho J, Garcia-Molina H. Synchronization a database to improve freshness[A]. Proceedings of 2000 ACM International Conference on Management of Data[C].New York:ACM Press,2000.117-128.
  • 3[2]Cho J, Garcia-Molina H. The evolution of the Web and implications for an incremental crawler[A]. Proceedings of the 26th International Conference on Very Large Date Bases[C].San Fransisco:Morgan Kaufmann Publishers Inc,2000.200-209.
  • 4[3]Cho J, Garcia-Molina H. Estimating frequency of change[J].ACM Trans Internet Techn,2003,3(3):256-290.
  • 5[4]Brewington B, Cybenko G. Keeping up with the changing web[J]. IEEE Computer,2000,33(5):52-58.
  • 6[5]Cho J, Garcia-Molina H, Lawrence P. Efficient crawing through URL ordering[J]. Computer Networks,1998,30(1-7):161-172.
  • 7[6]Hirai J. WebBase: A repository of Web pages[J]. Computer Networks,2000,33(1-6):277-293.
  • 8[7]Cho J, Ntoulas A. Effective change detection using sampling[A]. Proceedings of the 28th International Conference on Very Large Data Bases[C]. San Fransisco: Morgan Kaufmann Publishers Inc,2002.514-525.
  • 9J Cho,H Garcia-Molina.Parallel Crawlers[J].Proc of the 11th Int'l World Wide Web Conf[C].2002.124-135.
  • 10E Fredkin.Trie Memory[J].Communication of the ACM,1960,3(9):490-500.

共引文献26

同被引文献56

引证文献8

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部