摘要
研究实现了一个分布式网络爬虫系统。系统架构主要分为控制节点和爬行节点两部分,并描述了分布式系统关键技术的解决方案。系统采用二级哈希映射算法进行任务分配以解决基于目标导向、负载均衡的URL分配问题,使用消息通信使节点相互协作,提出利用遗传算法作为该主题爬虫系统的搜索策略,并给出了网页更新策略的改进方法。
The thesis studies and realises a distributed network crawler system. Its system architecture consists of two major parts: the control node and the crawl node. The key technology solution to distributed system is described,too. The system applies the second level Hash algorithm to task assignment to solve the URL allocation issue based on target orientation and load balance. Nodes are cooperative to each other by means of messaging. The thesis suggests the genetic algorithm to be the search strategy for the topic crawler,and an improved method for webpage update strategy.
出处
《计算机应用与软件》
CSCD
2010年第12期135-138,共4页
Computer Applications and Software
关键词
主题爬虫
分布式
遗传算法
搜索引擎
Web crawler Distributed Genetic algorithm Search engine