摘要
针对主题爬行技术中的单一分类算法在面对多主题Web抓取和分类需求时泛化能力不强的局限,设计一种利用多种强分类算法形成的分类器组合,主题爬行器根据当前主题任务在线评估并为分类器排名,从中选择最优分类器分类的策略,并开展在多个主题抓取任务下的分类实验,比较每种分类算法的准确率和组合后的平均分类准确率以及对分类效率等评价指标的综合分析,结果证明该策略对领域局域性有所克服,普适性较强。
For the limitation that generalization capacity of crawler is facing multi-topic Web crawling and classification, combination formed of multiple strong classification algorithms. online according to the current topic, and classifies Web pages single classification algorithm is not strong when focused the paper proposed a strategy of using multi-classifier The focused crawler evaluates and ranks the classifiers by selecting the better classifiers. Through classification experiments of multiple topics crawling tasks, comparing between accurate rate of each classification algorithm and average classification accurate rate of multi-classifier combination, and comprehensive analysis of the two indicators classification accuracy and classification efficiency, it proved the proposed method is better in universality, to a certain extent and overcomes the limitations of a single classifier.
出处
《图书情报工作》
CSSCI
北大核心
2013年第14期114-120,共7页
Library and Information Service
关键词
主题爬行技术
主题爬行器
网页分类
分类算法
多分类器组合
分类准确率
分类效率
focused crawling
focused crawler
Web page classification
classification algorithm
multiple classifiers combination
classification accuracy
classification efficiency