摘要
从Web中快速、准确地检索出所需信息的迫切需求催生了专业搜索引擎技术。在专业搜索引擎中,网络爬虫(Crawler)负责在Web上搜集特定专业领域的信息,是专业搜索引擎的重要核心部件。该文对中文专业网页的爬取问题进行了研究,基于KL距离验证了网页内容与链接前后文在分布上的差异,在此基础上提出了以链接锚文本及其前后文为特征、Nave Bayes分类器制导的中文专业网页爬取算法,设计了自动获取带链接类标的训练数据的算法。以金融专业网页的爬取为例,分别对所提出的算法进行了离线和在线测试,结果表明,Nave Bayes分类器制导的网络爬虫可以达到近90%的专业网页收割率。
The urgent need for quick and accurate information from the Web contributes to the domain specific search engine,in which the crawler is a keycomponent to the desired WebPages.Focused on the crawling of domain specific Chinese WebPages,this paper first examines the the distributional difference between WebPages and link contexts via the KL distance,and then proposes a Nave Bayesian classifier-guided algorithm to collect the domain specific Chinese WebPages.The classifier uses anchor text of hyperlink and its context as features.An algorithm is further designed to automatically collect labeled hyperlinks necessary for training the classifier.Taking the financial WebPages as an example,both the off4line and on-line tests are performed to validate the algorithm.The results show that the crawler guided by the Nave Bayesian classifier reaches nearly 90% accuracy in the domain specific WebPages.
出处
《中文信息学报》
CSCD
北大核心
2010年第4期32-38,62,共8页
Journal of Chinese Information Processing