期刊文献+

一种增量式并行Web信息采集方法 被引量:5

A Parallel System of Incremental Web Information Gathering
下载PDF
导出
摘要 提出了一个基于多线程并行的增量式Web信息采集结构模型,并加以实现,该模型以线程并行的方式对Web页面同时采集,实现了全面、高效并且灵活的信息搜集,在系统实现过程中,采取Java语言中最新的特性、独特的URL调度策略保证了各个线程时间的下载并行与互不相交,页面分析过程为各个线程源源不断地提供下载源,而指纹判别算法保证了并行采集过程中的同步,有效地去除了冗余。对该系统作了测试,实验证明,该系统能有效地提高信息采集性能。 This paper gets into the research on how to crawl information effectively in some sections of Web, which is also called parallel Web crawling technology, and brings forward a structure design model of the parallel incremental Web crawler. In order to download Web pages in parallel, the means of multiple thread and the latest character of Java language are adopted, meanwhile the paper adopts the right means for URL dispatching to make sure that threads would work in parallel with page analysis. In order to reduce redundancy, the method chooses footprint algorithm and extracts URL for threads to download. The test result proves the expect. It can effectively improve information gathering performance.
作者 杨天奇 周晔
出处 《计算机工程》 EI CAS CSCD 北大核心 2006年第20期97-99,共3页 Computer Engineering
基金 广东省自然科学基金资助项目(5006102)
关键词 WEB 信息采集 搜索引擎 并行 Web Information gathering Search engine Parallel
  • 相关文献

参考文献4

  • 1Edwards J,McCurley K,Tomlin J.An Adaptive Model for Optimizing Performance of an Incremental Web Crawler[C].Proceedings of the 10th International World Wide Web Conference,2001-05:1245-1249.
  • 2Keiji Y.A Fast Image-gathering System from the World Wide Web Using a PC Cluster[J].Image and Vision Computing,2004,22(1):24-28.
  • 3Lawrence S,Giles C L.Accessibility of Information on the Web[J].Nature,2003,400(6740):107-109.
  • 4Merugu,Shashidhar.Adding Structure to Unstructured Peer-to-peer Networks:the Use of Small-world Graphs[J].Journal of Parallel and Distributed Computing,2005,65(2):54-59.

同被引文献20

引证文献5

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部