摘要
提出了一个基于多线程并行的增量式Web信息采集结构模型,并加以实现,该模型以线程并行的方式对Web页面同时采集,实现了全面、高效并且灵活的信息搜集,在系统实现过程中,采取Java语言中最新的特性、独特的URL调度策略保证了各个线程时间的下载并行与互不相交,页面分析过程为各个线程源源不断地提供下载源,而指纹判别算法保证了并行采集过程中的同步,有效地去除了冗余。对该系统作了测试,实验证明,该系统能有效地提高信息采集性能。
This paper gets into the research on how to crawl information effectively in some sections of Web, which is also called parallel Web crawling technology, and brings forward a structure design model of the parallel incremental Web crawler. In order to download Web pages in parallel, the means of multiple thread and the latest character of Java language are adopted, meanwhile the paper adopts the right means for URL dispatching to make sure that threads would work in parallel with page analysis. In order to reduce redundancy, the method chooses footprint algorithm and extracts URL for threads to download. The test result proves the expect. It can effectively improve information gathering performance.
出处
《计算机工程》
EI
CAS
CSCD
北大核心
2006年第20期97-99,共3页
Computer Engineering
基金
广东省自然科学基金资助项目(5006102)
关键词
WEB
信息采集
搜索引擎
并行
Web
Information gathering
Search engine
Parallel