摘要
随着互联网的普及和发展,网络上的信息资源越来越丰富,它需要高效智能的工 具来完成信息资源的采集。WWW上的网页抓取器,又称Robot. 讨论了抓取器与文本自动分类 器相结合,对用户要求领域网页的收集。抓取器找到相关链接进行抓取,而避免对非相关链 接的抓取。这样可以节省硬件、网络资源和提高抓取器的效率。
With the rapid expansion of Internet and the continuous increase of the amount of information on WWW.It is desired to develop efficient and intelli gentized tools to do it.A WWW information discovery and collect tool is called a robot. This paper disusses the combination of the text automatic classification with robot . The goal is to selectively seek out pages that are relevant to a p re-defined set of topics. The robot finds the link that is likely to be most rel evant for the robot,and avoids irrelevant regions of the Web.This leads to signi ficant savings in network resource, and keeps robot more efficient.
出处
《计算机工程》
CAS
CSCD
北大核心
2003年第21期123-124,127,共3页
Computer Engineering
关键词
网页机器人
文本自动分类
向量空间模型
Internet robot
Text automatic classification
Vector space model