摘要
简单介绍主题信息采集系统;从5个方面对其核心技术进行深入研究,包括种子页面生成、主题表示、相关度计算策略、爬行策略以及结束搜索策略等;详细讨论种子页面生成的人工方式、自动方式及混合方式,基于关键词的主题表示与基于Ontology的主题表示,多种相关度计算启发式策略比较,基本爬行策略与隧道技术以及结束爬行的多种情形等;在分析相关技术的算法、特点与应用情况的同时,针对主题信息采集特点提出相应的改进意见。
This paper briefly introduces the core technologies of the focused Web crawler. Three main modes are used to create seed URLs. Several methodical technologies, such as keyword- and ontology-based topic description, various heuristic functions and algorithms, tunneling methods, basic focused crawling strategies and strategies to stop crawling, are discussed and analyzed in this paper. Furthermore, suggestions are put forward to improve the Web crawling technologies by comparing the merits and demerits of focus crawling algorithms.
出处
《图书情报工作》
CSSCI
北大核心
2005年第4期77-80,70,共5页
Library and Information Service
关键词
WEB
搜索引擎
主题采集
技术
Web search engine focused crawling technology