摘要
分布式网络爬虫作为一门新兴技术,已经应用在一些大型商业的搜索引擎系统当中.重点放在分布式技术在网络爬虫领域中,URL去重这一分布式网络爬虫的核心问题上,以基于内存的去重方式为基础,扩展改进传统的广义表数据结构,提出了一种新的基于内存改进广义表的URL去重算法.这种算法与传统的去重算法相比较,在空间效率可行范围之内,有效地缩短了单次去重的时间,使总控服务器上的去重不再成为整个系统的瓶颈.
As a new technology, distributed web spider has been widely applied to some great commercial search engine systems. The stress is laid on the core problem of distributed web spider - Unrepeated URL. Based on the memory mode Unrepeated URL, the traditional generalized lists data framework is expanded and improved. A new Unrepeated URL algorithm based on the memory improved generalized list is put forth. Compare with the traditional Unrepeated algorithm, this algorithm can effectively improve the time of single detecting nearduplicate under the approval range of space efficiency, which makes the Unrepeated URL in the general control server impossible to become the bottle - neck of the whole system.
出处
《平顶山学院学报》
2009年第5期116-119,共4页
Journal of Pingdingshan University
关键词
网络爬虫
分布式
URL去重
广义表
wob spider
distributed
unrepeated URL
generalized lists