摘要
近年来人们提出了很多新的搜集思想,他们都使用了一个共同的技术———集中式搜集。集中式搜集通过分析搜索的区域,来发现与主题最相关的链接,防止访问网上不相关的区域,这可以大量地节省硬件和网络资源,使网页得到尽快的更新。为了达到这个搜索目标,本文提出了两个算法:一个是基于多层分类的网页过滤算法,试验结果表明,这种算法有较高的准确率,而且分类速度明显高于一般的分类算法;另一个是基于Web结构的URL排序算法,这个算法充分地利用了Web的结构特征和网页的分布特征。
Several new crawling ideas have been proposed in recent years;among them a common technique is focused crawling.A focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web.This leads to significant savings in hardware and network resources,and helps keep the crawl more up-to-date.To achieve such goal-directed crawling,this paper puts forward two algorithms:a Web page filtering based on multilayer classifier,the experimental result shows the algorithm has superior veracity and it is more quick than other classifiers;the other algorithm is a URL ordering algorithm based on Web structure which makes the best use of the characters of Web structure and the characters of Web pages distributing.
出处
《计算机与现代化》
2004年第10期1-5,14,共6页
Computer and Modernization
基金
国家自然科学基金资助项目(79990580)
国家973资助项目(G1998030414)
关键词
URL排序
集中式搜集器
多层分类
主题过滤
URL ordering
focused crawler
multi-layer classification
topic distillation