摘要
提出一种基于RSS的博客采集系统实现方案。设计两个爬虫,一个负责广度优先遍历互联网,获取每个用户对应的RSS地址;另一个负责对每个RSS地址垂直搜索,跟踪检测是否有更新的博客文章,以增量方式将更新文章装入数据库。并为算法设计实现一个模型系统。
This paper focuses on how to crawl Weblogs effectively in some sections of Web, and brings forward an arithmetic of the Weblog gathering based on RSS. The authors design two crawlers, one of which is responsible for gathering RSS by performing a breadth - first traversal of the Web, and the other tracks updated Weblogs automatically by performing a vertical search of every RSS. Also A model system is implemented.
出处
《现代图书情报技术》
CSSCI
北大核心
2007年第11期45-48,共4页
New Technology of Library and Information Service
基金
863计划重点基金项目"跨媒体搜索关键技术研究及服务产品开发"(项目编号:2006AA010105)的研究成果之一
关键词
RSS
博客
信息采集
RSS Weblog Information gathering