分布式主题爬虫的设计与实现被引量：3

RESEARCH AND IMPLEMENTATION OF A DISTRIBUTED TOPIC CRAWLER

下载PDF

导出

摘要研究实现了一个分布式网络爬虫系统。系统架构主要分为控制节点和爬行节点两部分,并描述了分布式系统关键技术的解决方案。系统采用二级哈希映射算法进行任务分配以解决基于目标导向、负载均衡的URL分配问题,使用消息通信使节点相互协作,提出利用遗传算法作为该主题爬虫系统的搜索策略,并给出了网页更新策略的改进方法。 The thesis studies and realises a distributed network crawler system. Its system architecture consists of two major parts： the control node and the crawl node. The key technology solution to distributed system is described,too. The system applies the second level Hash algorithm to task assignment to solve the URL allocation issue based on target orientation and load balance. Nodes are cooperative to each other by means of messaging. The thesis suggests the genetic algorithm to be the search strategy for the topic crawler,and an improved method for webpage update strategy.

作者池勇敏郝泳涛

机构地区同济大学CAD研究中心

出处《计算机应用与软件》 CSCD 2010年第12期135-138,共4页 Computer Applications and Software

关键词主题爬虫分布式遗传算法搜索引擎 Web crawler Distributed Genetic algorithm Search engine

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Roc1′oL.Using genetic algorithms to evolve a population of topical queries[J].Information Processing and Management,2008(44):1863-1878.
2Soumen Chakrabarti.Focused crawling:a new approach to topic-specific Web resource di3scovery[J].Computer Networks,1999(31):1623-1640.
3白鹤,汤迪斌,王劲林.分布式多主题网络爬虫系统的研究与实现[J].计算机工程,2009,35(19):13-16. 被引量：20
4刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量：132
5钱榕,徐新华,郑莹,杨炳儒.智能专题化信息搜集Crawler[J].计算机工程,2006,32(3):57-59. 被引量：4
6Andrei Z.Marc Najork.Efficient URL caching for World Wide Web crawling.ACM press,2003:679-689.

二级参考文献38

1李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量：15
2钱榕,徐新华,郑莹,杨炳儒.智能专题化信息搜集Crawler[J].计算机工程,2006,32(3):57-59. 被引量：4
3Menczer F,Srinivasan G P P,Ruiz M.Evaluating Topic-driven Web Crawlers[C].Proceedings of the 24th Annual International ACM/SIGIR Conference,2001.
4Grama A,Karypis G,Kumar V,et al.Introduction to Parallel Computing (Second Edition)[M].Boston:Addison-Wesley,2003.
5Brin S, Page L. The Anatomy of a Large Scale Hyper Textual Web Search Engine [C]. Proceeding of the WWW7 Conference, Elsevier,Australia, 1998: 107-117.
6Rungsawang A, Angkawattanawit N. Learnable Topic-specific Web Crawler[J]. Journal of Network and Computer Applications, 2005, 28(2): 97-114.
7Chakrabhik S, Vandenburg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery[C]//Proceedings of the 8th International World-Wide Web Conference. Toronto, Canada: [s. n.], 1999.
8Liu Hongyu, MIuOS E, Janssen J. Probabilistic Models for Focused Web Crawling[C]//Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management. New York, USA: ACM Press, 2004.
9Florescu D, Levy A, Mendelzon A. Database Techniques for the World-Wide Web: A Survey[J]. SIGMOD Record, 1998, 27(3): 59-74.
10Wei Jiying, Wen Jirong. instance-based Schema Matching for Web Databases by Domain-specific Query Probing[C]//Proceedings of the 30th international Conference on VLDB. Toronto, Canada: [s. n.], 2004.

共引文献151

1赵志滨,贾岩峰,姚兰,鲍玉斌.含有丰富结构化数据的Web页面分类技术的研究[J].计算机研究与发展,2013,50(S1):53-60. 被引量：5
2郑煜,钱榕.一个基于链接分析的相关度排序算法及其在专题搜索引擎中应用[J].计算机应用与软件,2007,24(7):54-55. 被引量：5
3尹江,尹治本,黄洪.网络爬虫效率瓶颈的分析与解决方案[J].计算机应用,2008,28(5):1114-1116. 被引量：18
4曾伟辉,李淼.深层网络爬虫研究综述[J].计算机系统应用,2008,17(5):122-126. 被引量：39
5倪贤贵,蔡明.基于链接结构和内容相似度的聚焦爬虫系统[J].计算机工程与设计,2008,29(7):1709-1710. 被引量：3
6王岩.搜索引擎中网络爬虫技术的发展[J].电信快报（网络与通信）,2008(10):20-22. 被引量：11
7戚欣.基于本体的主题网络爬虫设计[J].武汉理工大学学报,2009,31(3):138-141. 被引量：14
8蒋元成,蔡皖东.基于主动探测的BT行为监测系统设计与实现[J].航空计算技术,2009,39(1):134-137. 被引量：1
9张博,蔡皖东.面向主题的网络蜘蛛技术研究及系统实现[J].微电子学与计算机,2009,26(5):52-55. 被引量：13
10曾云令,蔡皖东.eMule行为监测技术研究与系统实现[J].微电子学与计算机,2009,26(5):126-129.

同被引文献21

1Ioannis Avraam, Ioannis Anagnostopoulos. A Comparison over Focused Web Crawling Strategies[ C]//Informatics (PCI) 2011 15th Confer- ence on Panhellenic ,2011:245 - 249.
2Hersovici M, Jacovi M, Maarek, et al. The Shark-Search Algorithm-an Application:Tailored Web Site Mapping[ C ]//Proceedings of the 7th International World Wide Web Conference,1988:317-326.
3杨仁广,宋宇,孟祥增.一种改进Shark-Search的多媒体主题搜索算法[J].计算机应用与工程,2010,46(14):152-154.
4Brin S, Page L. Anatomy of a Large-Scale Hypertextual Web Search En- gine[ C ]//Prec. 7th International World Wide Web Conference,1998.
5Judy Johnson, Kostas Tsioutsiouliklis, Clee Giles. Evolving strategies for focused web crawling[ C ]//International Conference on Machine Learning, 2003.
6Dirk Ahlers, Susanne Boll. Urban web crawling [ C ]//Proceeding of the 17th international conference on World Wide Web, April 21 -25, 2008:25 - 32.
7Tax D, Duin R. Data domain description by support Vectors[ C ]//Pro- ceeding of European Symposium on Artificial Neural Networks. Bel- gium, 1999:251 - 256.
8Animesh Tripethly, Prashanta KPatra. A web mining architectural mod- el of distributed crawler for lnternet searches using PageRank algorithm [ C ]. Asia-Pacific service Computing Conference ( IEEE Xplore). AP- SCC ,2008:513 -518.
9温泉,丁祥武.基于主题聚焦模型的PageRank改进算法[J].计算机应用与软件,2011,28(3):173-175. 被引量：2
10郭涛,黄铭钧.社区网络爬虫的设计与实现[J].智能计算机与应用,2012,2(4):65-67. 被引量：10

引证文献3

1汪伟,魏岩,杨煜普.基于模糊SVDD监督的PageRank主题爬虫算法[J].计算机应用与软件,2014,31(3):16-19.
2黎佳.面向电商网站的分布式爬虫系统开发[J].通讯世界,2018,25(8):106-107. 被引量：1
3马蕾,冯锡炜,窦予梓,高天铸,朱睿,吴衍兵.分布式爬虫的研究与实现[J].计算机技术与发展,2020,30(2):192-196. 被引量：9

二级引证文献10

1董富江,张文学.分布式主题舆情采集与分析系统设计[J].软件导刊,2020,19(11):116-119.
2卜意磊,庞文迪,陈汝鹏,陈妙苗.市场监管大数据归集系统建设研究[J].电子技术与软件工程,2021(4):178-180. 被引量：2
3朱明超,宋晖.多任务数据采集系统的设计与实现[J].新一代信息技术,2021,4(10):8-12.
4沈熠辉.以Selenium为核心的亚马逊爬虫与可视化[J].福建电脑,2021,37(12):43-46. 被引量：5
5郑帅.基于爬虫的分布式信息采集处理平台的设计[J].信息与电脑,2021,33(22):143-145. 被引量：1
6奚增辉,王卫斌,陆嘉铭,瞿海妮.应用主题爬虫的电力网络舆情数据采集[J].西安工程大学学报,2022,36(2):72-78. 被引量：7
7郭刚,唐萍峰,叶林佶,杨超.基于爬虫技术的政策数据应用研究[J].信息系统工程,2022,35(7):56-59. 被引量：1
8卢照,师军,张耀午,王琦.基于双缓冲的分布式爬虫调度策略的设计与研究[J].计算机与数字工程,2022,50(8):1686-1690. 被引量：4
9张军,魏继桢,李钰彬.基于资源感知的分布式爬虫任务调度方法[J].现代电子技术,2024,47(9):86-90.
10陶飞飞,徐佳,徐松阳,唐明伟.基于VSM与HITS融合的扩展主题型爬虫[J].计算机仿真,2024,41(10):222-226.

1黄博,蔚赵春,关佶红.无线传感器网络中基于值的kNN查询处理[J].微电子学与计算机,2009,26(9):19-22.
2董书暕,汪璟玢.HMSST:一种高效的SPARQL查询优化算法[J].计算机科学,2014,41(B11):323-326. 被引量：4
3高磊,张德运,赵东平,郑卫斌.跳完整性中大数据包问题的研究[J].计算机工程,2005,31(12):31-32.
4赵新辉,李杰.大数据云存储信息查询路径优化仿真研究[J].计算机仿真,2016,33(8):181-184. 被引量：11
5张文.基于Servlet的搜索引擎[J].软件,2011,32(2):75-77. 被引量：3
6严锦立,王开宇,夏雨生.可扩展并行入侵检测体系结构[J].中国科技信息,2014(14):128-130.
7邹嘉,司天歌,黄连生,戴一奇.一种新的基于P2P系统的小额支付协议[J].清华大学学报（自然科学版）,2006,46(4):563-567. 被引量：1
8林武,洪景新,张昊,李琳.快速有效的视频图像序列拼接方法[J].计算机工程与应用,2009,45(24):173-175. 被引量：13
9宋延爽,耿楠.基于SURF算法的全景图拼接技术研究与实现[J].计算机工程与设计,2012,33(12):4647-4651. 被引量：12
10杜俊俐,黄心汉,郭清宇.医学图像三维重建及实时性研究[J].计算机工程与应用,2007,43(19):206-209. 被引量：4

计算机应用与软件

2010年第12期

浏览历史

内容加载中请稍等...

分布式主题爬虫的设计与实现被引量：3

参考文献6

二级参考文献38

共引文献151

同被引文献21

引证文献3

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

分布式主题爬虫的设计与实现 被引量：3

参考文献6

二级参考文献38

共引文献151

同被引文献21

引证文献3

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

分布式主题爬虫的设计与实现被引量：3