摘要
飞速发展的网络给综合性的采集系统带来了巨大的挑战 ,由此小型的专题信息采集已成为近年的研究热点。文章介绍了专题的 Web信息采集系统的基本原理 ,分析了专题页面在网络中的分布特性 ,提出了一种通过提供高质量种子集的方法来改善采集器性能的方法 ,节约了硬件和网络资源 ,使更新更加容易。
The rapid growth of the WorldWide Web poses unprecedented scaling challenges for generalpurpose crawlers. So the focused Web crawler becomes the focus research. We introduce the basic principles on focused Web crawler, the main function and technology. Based on analyzing distribution of the pages that are relevant to a topic in the Web, a new approach that provides the crawler with a good set of seeds is brought forward to improve the crawler's performance, leads to savings in hardware and network resources, and helps the crawler more easy to update.
出处
《电脑与信息技术》
2004年第6期52-55,共4页
Computer and Information Technology