摘要
爬虫是搜索引擎的重要组成部分,它沿着网页中的超链接自动爬行,搜集各种资源。为了提高对特定主题资源的采集效率,文本分类技术被用来指导爬虫的爬行。本文把基于支持向量机的文本自动分类技术应用到化学主题爬虫中,通过SVM 分类器对爬行的网页进行打分,用于指导它爬行化学相关网页。通过与基于广度优先算法的非主题爬虫和基于关键词匹配算法的主题爬虫的比较,表明基于SVM分类器的主题爬虫能有效地提高针对化学Web资源的采集效率。
Crawler is an important component of search engine, which collects Web pages through hyperlink between the pages. In order to enhance the performance of topic-specific search engines, text categorization techniques can be used to direct the crawling of focused crawlers. Based on Support Vector Machine, a new chemistry focused crawler is proposed in this paper. It can guide the focused crawler to collect the chemistry Web pages, and ignore the irrelevant information. The experiment results show that the focused crawler with SVM classifier is more effective to collect chemistry relevant pages, compared to the crawlers based on breadth first and keyword matching.
出处
《计算机与应用化学》
CAS
CSCD
北大核心
2006年第4期329-332,共4页
Computers and Applied Chemistry
基金
国家自然科学基金资助项目(20273076)