摘要
海量网页的存在及其量的急速增长使得通用搜索引擎难以为面向主题或领域的查询提供满意结果。本文研究的主题爬虫致力于收集主题相关信息,达到极大降低网页处理量的目的。它通过评价网页的主题相关度,并优先爬取相关度较高的网页。利用一种基于子空间的语义分析技术,并结合贝叶斯以及支持向量机,设计并实现了一个高效的主题爬虫。实验表明,此算法具有很好的准确性和高效性。
Massive web and its rapid growth make it difficult for general-purpose search engines to provide satisfactory results for the theme-or area-oriented queries. This paper studies the subject of gathering information relevant to the subject,to significantly reduce the amount of web pages dealing. By assessing the degree of Web pages,it gives priority to the crawling pages related to a higher degree. Using a subspace-based semantic analysis technique,combined with the Bayesian mechanism and support vector machine,we design and implement an efficient topic crawler. Experiments show that our algorithm has good accuracy and efficiency.
出处
《计算机工程与科学》
CSCD
北大核心
2010年第9期145-147,151,共4页
Computer Engineering & Science
关键词
主题爬虫
子空间
语义分析
支持向量机
topic crawler
subspace
semantic analysis
support vector machine